Image Credit: wagg66 from FreeImages.
Walmart of Mexico (or WALMEX) is one of the most important retail companies within the region, with 3,903 stores in Mexico and Central America, an equity of 199,086,037 MXN, and a yearly revenue of 880,121,761 MXN, according to the figures from December 2023. According to WALMEX last financial report, its goal is to double its sales in a period of 10 years (Wal-Mart de México S.A.B. de C.V., 2024).
Time series are "a set of data points ordered in time" (Peixeiro, 2022), which can be analyzed to calculate forecasts and get valuable insights (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023).
Univariate time series is the most used approach when analyzing time series (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023), by means of models such as Moving Average (MA), Autoregressive Moving Average (ARMA), Autoregressive Integrated Moving Average (ARIMA), or Simple Exponential Smoothing; which solely depend on the time and the variable under study.
On the other hand, it is also possible to forecast time series using regression-based modeling, in which other variables or features are used to predict the response variable (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023). This approach could have the advantage of quantifying the impact of the external economic indicators in the performance of an organization.
In the case of Mexico, it is possible to collect public data from different government offices such as INEGI or BANXICO, or from international sources such as the S&P500, and to assess how they correlate to revenue.
In this context, it is desirable to explore both approaches to predict WALMEX net sales over the next years. Thus, the purpose of the present project is to forecast WALMEX net sales and, then, use that information to predict whether WALMEX will be able to achieve its long-term goal of doubling its sales within the next ten years.
To predict whether Walmart of Mexico will double its sales within the next ten years.
Will Walmart of Mexico be able to double its sales within the next ten years? If so, when?
Walmart de México will manage to double its sales within the next ten years.
The methodology of the present study is based on the CRISP-DM (Chapman et al., 2000) framework and Rollin’s Foundational Methodology for Data Science (Rollins, 2015):
Data modeling: Ten univariate time series models were built and assessed in Python 3 and its libraries Statsmodels and Prophet to predict the net sales of WALMEX:
Prophet Univariate Time Series Modeling.
Then, three vector models were created and trained in Python 3 and its libraries Statsmodels and Darts to predict the values of the selected macroeconomic indicators as a multivariate time series:
Vector Autoregressive Integrated Moving Average (VARIMA) model
After that, two regression models were built using Random Forests and Support Vector Machines in Python 3 and its library Scikit-learn to predict WALMEX total sales based on the predictions for the best performing multivariate time series model.
All the models were fit using a training set with 80% of the data, and assessed using a testing set with the remaining 20% of the data. The scores Root Mean Squared Error (RMSE), the Mean Absolute Error (MAE), and Coefficient of Determination $(r^{2})$ were used for model assessment.
Evaluation: The different models were ranked in terms of the Root Mean Squared Error (RMSE), the Mean Absolute Error (MAE), and Coefficient of Determination $(r^{2})$. The best univariate, multivariate, and regression models were selected and retrained with all the historical data, and they were used to predict whether Walmart of Mexico would be able to double its sales within the next ten years.
In this context, the purpose of the present notebook is to collect, explore, prepare, model and evaluate the data from the different sources.
# Loading Requirements Text File
# %pip install -r requirements.txt
# Libraries installation
# %pip install ipykernel
# %pip install numpy
# %pip install pandas
# %pip install yfinance
# %pip install openpyxl
# %pip install matplotlib
# %pip install seaborn
# %python -m pip install statsmodels
# %pip install -U scikit-learn
# %pip install prophet
# %pip install darts
# Libraries importation
import numpy as np
import pandas as pd
import yfinance as yf
import json
from urllib.request import urlopen
import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib_inline.backend_inline
from matplotlib.lines import Line2D
import seaborn as sns
import warnings
from itertools import product
from itertools import combinations_with_replacement
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
from statsmodels.tsa.vector_ar.vecm import coint_johansen
from statsmodels.stats.diagnostic import acorr_ljungbox
from statsmodels.tsa.stattools import grangercausalitytests
from statsmodels.tsa.seasonal import seasonal_decompose
from sklearn.model_selection import train_test_split
from statsmodels.tsa.ar_model import AutoReg
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.api import SimpleExpSmoothing
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from prophet import Prophet
from prophet.diagnostics import cross_validation
from prophet.diagnostics import performance_metrics
from statsmodels.tsa.vector_ar.var_model import VAR
from statsmodels.tsa.statespace.varmax import VARMAX
from darts.models.forecasting.varima import VARIMA
from darts import TimeSeries
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
d:\OneDrive - Universidad La Salle Nezahualcóyotl\Projects\SalesForecasting\.venv\lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
# Setting theme and plot resolution
sns.set_theme(context = 'notebook', style = 'darkgrid')
mpl.rcParams["figure.dpi"] = 100
mpl.rcParams["savefig.dpi"] = 300
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
# Setting default plot's aesthetics
plotfontcolor = 'dimgray'
mpl.rcParams['text.color'] = plotfontcolor
mpl.rcParams['axes.labelcolor'] = plotfontcolor
mpl.rcParams['xtick.color'] = plotfontcolor
mpl.rcParams['ytick.color'] = plotfontcolor
mpl.rcParams["font.size"] = 10
mpl.rcParams['axes.titlesize'] = 14
mpl.rcParams['axes.labelsize'] = 12
plt.rcParams['xtick.labelsize'] = 10
plt.rcParams['ytick.labelsize'] = 10
plt.rcParams['legend.fontsize'] = 11
mpl.rcParams["axes.labelweight"] = "bold"
mpl.rcParams["axes.titleweight"] = "bold"
#mpl.rcParams["legend.facecolor"] = "lightgray"
#mpl.rcParams["legend.edgecolor"] = "gray"
#mpl.rcParams["legend.framealpha"] = 0.8
#mpl.rcParams['font.family'] = 'sans-serif'
#mpl.rcParams['font.family'] = 'serif'
# Disabling warnings
warnings.filterwarnings('ignore')
# Color for plots
time_series_color = sns.color_palette('Blues_r')[0]
pred_color = '#C70039'
Firstly, WALMEX sales figures were retrieved from the Walmex's investors website. As the financial data was disclosed in a quaterly basis in 40 PDF files hosted on a variety of inconsistent links, for sake of efficiency, it was decided to collect the data manually and consolidate it in an Excel file.
The amount of files was sizeable for manual handling, and the complexity of developing a script for scraping and parsing each file was too high to just retrieve the account of Net sales. Thus, it was decided to proceed in a manual fashion.
Its important to note that net sales figures for WALMEX are in millions of Mexican pesos (mdp, for its acronym in Spanish).
# Loading sales data
link = 'https://github.com/DanielEduardoLopez/SalesPrediction/raw/main/Walmex_Quarterly_Net_Sales.xlsx'
sales_df = pd.read_excel(link)
sales_df.head()
| Year | Quarter | Net Sales (mdp) | Units | Sq.mt. Mexico | Sq.mt. Central America | |
|---|---|---|---|---|---|---|
| 0 | 2014 | Q1 | 101405 | 2867 | NaN | NaN |
| 1 | 2014 | Q2 | 103300 | 2879 | NaN | NaN |
| 2 | 2014 | Q3 | 104367 | 2904 | NaN | NaN |
| 3 | 2014 | Q4 | 128586 | 2983 | NaN | NaN |
| 4 | 2015 | Q1 | 110875 | 2987 | NaN | NaN |
Then, the stock close values of WALMEX, and the index values from the IPC and the S&P500 were retrieved from the API of Yahoo! Finance through the library yfinance.
# Retrieving market data for WALMEX, IPC and S&P500
stocks = ["WALMEX.MX", "^MXX", "^GSPC"]
stocks_df = yf.download(stocks, start="2014-02-01", end="2024-02-01")['Close']
stocks_df.head()
[*********************100%%**********************] 3 of 3 completed
| Ticker | WALMEX.MX | ^GSPC | ^MXX |
|---|---|---|---|
| Date | |||
| 2014-02-03 | NaN | 1741.890015 | NaN |
| 2014-02-04 | 31.500000 | 1755.199951 | 40085.519531 |
| 2014-02-05 | 31.379999 | 1751.640015 | 39880.871094 |
| 2014-02-06 | 31.190001 | 1773.430054 | 40288.781250 |
| 2014-02-07 | 29.850000 | 1797.020020 | 40525.738281 |
The currency of the WALMEX stock value is Mexican pesos (MXN).
Later, the GDP and inflation data were retrieved from INEGI's Query Constructor, saving the response JSON files into disk, and then loading them into the notebook.
It's important to note that GDP values are in millions of Mexican pesos at 2018 prices.
# Loading GDP data
gdp_link = "https://raw.githubusercontent.com/DanielEduardoLopez/SalesPrediction/main/gdp_data.json"
response = urlopen(gdp_link)
gdp_json = json.loads(response.read())
# Retrieving series
gdp_dict = gdp_json["Series"][0]["OBSERVATIONS"]
# Converting dict into a dataframe
gdp_df = pd.DataFrame.from_dict(gdp_dict, orient='columns')
gdp_df.head()
| TIME_PERIOD | OBS_VALUE | OBS_EXCEPTION | OBS_STATUS | OBS_SOURCE | OBS_NOTE | COBER_GEO | |
|---|---|---|---|---|---|---|---|
| 0 | 2023/04 | 25596360.563 | 1 | 17 | None | 00 | |
| 1 | 2023/03 | 25121269.269 | 17 | None | 00 | ||
| 2 | 2023/02 | 25001939.978 | 17 | None | 00 | ||
| 3 | 2023/01 | 24291956.149 | 17 | None | 00 | ||
| 4 | 2022/04 | 24981146.477 | 17 | None | 00 |
It is noteworthy that the GDP values are sorted descending according to the time period, which is inconsistent with the other datasets, so, the GDP dataset was sorted ascending.
gdp_df = gdp_df.sort_values(by='TIME_PERIOD', ascending=True).reset_index(drop=True)
gdp_df.head()
| TIME_PERIOD | OBS_VALUE | OBS_EXCEPTION | OBS_STATUS | OBS_SOURCE | OBS_NOTE | COBER_GEO | |
|---|---|---|---|---|---|---|---|
| 0 | 1980/01 | 10401367.607 | 17 | None | 00 | ||
| 1 | 1980/02 | 10342350 | 17 | None | 00 | ||
| 2 | 1980/03 | 10392733.012 | 17 | None | 00 | ||
| 3 | 1980/04 | 10927666.353 | 17 | None | 00 | ||
| 4 | 1981/01 | 11345848.491 | 17 | None | 00 |
Finally, the MXN/USD exchange rates, the bonds interest rates (CETES 28), and the money market representative interest rates (28 day TIIE) data were retrieved from Banxico's website in form of CSV files.
# Loading MXN/USD exchange rates data
exchange_rates_link = 'https://raw.githubusercontent.com/DanielEduardoLopez/SalesPrediction/main/exchange_rates_data.csv'
exchange_rates_df = pd.read_csv(exchange_rates_link, encoding = "ISO-8859-1")
exchange_rates_df.head(15)
| Banco de México | Unnamed: 1 | |
|---|---|---|
| 0 | NaN | NaN |
| 1 | Exchange rates and auctions historical informa... | NaN |
| 2 | Tipos de cambio diarios | NaN |
| 3 | NaN | NaN |
| 4 | Date: 03/02/2024 06:14:08 | NaN |
| 5 | NaN | NaN |
| 6 | NaN | NaN |
| 7 | NaN | NaN |
| 8 | Title | Exchange rate pesos per US dollar, Used to set... |
| 9 | Information type | Levels |
| 10 | Date | SF60653 |
| 11 | 01/01/2014 | 13.0652 |
| 12 | 01/02/2014 | 13.0652 |
| 13 | 01/03/2014 | 13.0843 |
| 14 | 01/04/2014 | 13.1011 |
# Loading bonds interest rates data
bonds_rates_link = 'https://raw.githubusercontent.com/DanielEduardoLopez/SalesPrediction/main/bonds_rates_data.csv'
bonds_rates_df = pd.read_csv(bonds_rates_link, encoding = "ISO-8859-1")
bonds_rates_df.head(15)
| Banco de México | Unnamed: 1 | |
|---|---|---|
| 0 | NaN | NaN |
| 1 | Securities auctions and Open Market Operations | NaN |
| 2 | Valores Gubernamentales | NaN |
| 3 | NaN | NaN |
| 4 | Date: 03/02/2024 06:05:53 | NaN |
| 5 | NaN | NaN |
| 6 | NaN | NaN |
| 7 | NaN | NaN |
| 8 | Title | Bonds Issued by Public Agencies Weekly auction... |
| 9 | Information type | Levels |
| 10 | Date | SF43936 |
| 11 | 01/02/2014 | 3.16 |
| 12 | 01/09/2014 | 3.17 |
| 13 | 01/16/2014 | 3.05 |
| 14 | 01/23/2014 | 3.13 |
# Loading bonds interest rates data
interest_rates_link = 'https://raw.githubusercontent.com/DanielEduardoLopez/SalesPrediction/main/interest_rates_data.csv'
interest_rates_df = pd.read_csv(interest_rates_link, encoding = "ISO-8859-1")
interest_rates_df.head(15)
| Banco de México | Unnamed: 1 | |
|---|---|---|
| 0 | NaN | NaN |
| 1 | Securities prices and interest rates | NaN |
| 2 | Tasas de Interés en el Mercado de Dinero | NaN |
| 3 | NaN | NaN |
| 4 | Date: 03/02/2024 06:10:39 | NaN |
| 5 | NaN | NaN |
| 6 | NaN | NaN |
| 7 | NaN | NaN |
| 8 | Title | 28 day TIIE, Interest rate in annual percent |
| 9 | Information type | Levels |
| 10 | Date | SF43783 |
| 11 | 01/02/2014 | 3.795 |
| 12 | 01/03/2014 | 3.795 |
| 13 | 01/06/2014 | 3.7925 |
| 14 | 01/07/2014 | 3.7962 |
In this section, the collected datasets were explored to describe its attributes, number of records, shapes, data types, their quality in terms of its percentage of missing values and suspected extreme outliers; as well as to perform an initial exploration of statistical measures, time trends, distributions and relationships among variables.
# Preview of the dataset
sales_df.head()
| Year | Quarter | Net Sales (mdp) | Units | Sq.mt. Mexico | Sq.mt. Central America | |
|---|---|---|---|---|---|---|
| 0 | 2014 | Q1 | 101405 | 2867 | NaN | NaN |
| 1 | 2014 | Q2 | 103300 | 2879 | NaN | NaN |
| 2 | 2014 | Q3 | 104367 | 2904 | NaN | NaN |
| 3 | 2014 | Q4 | 128586 | 2983 | NaN | NaN |
| 4 | 2015 | Q1 | 110875 | 2987 | NaN | NaN |
sales_df.tail()
| Year | Quarter | Net Sales (mdp) | Units | Sq.mt. Mexico | Sq.mt. Central America | |
|---|---|---|---|---|---|---|
| 35 | 2022 | Q4 | 236272 | 3745 | 6655421.0 | 815407.0 |
| 36 | 2023 | Q1 | 204601 | 3755 | 6659477.0 | 818300.0 |
| 37 | 2023 | Q2 | 212164 | 3775 | 6688133.0 | 818300.0 |
| 38 | 2023 | Q3 | 211436 | 3802 | 6706432.0 | 819264.0 |
| 39 | 2023 | Q4 | 251921 | 3903 | 6831809.0 | 821822.0 |
# Basic info of the dataset
sales_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 40 entries, 0 to 39 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year 40 non-null int64 1 Quarter 40 non-null object 2 Net Sales (mdp) 40 non-null int64 3 Units 40 non-null int64 4 Sq.mt. Mexico 27 non-null float64 5 Sq.mt. Central America 27 non-null float64 dtypes: float64(2), int64(3), object(1) memory usage: 2.0+ KB
# Basic statistical description
sales_df.describe()
| Year | Net Sales (mdp) | Units | Sq.mt. Mexico | Sq.mt. Central America | |
|---|---|---|---|---|---|
| count | 40.000000 | 40.000000 | 40.000000 | 2.700000e+01 | 27.000000 |
| mean | 2018.500000 | 160831.950000 | 3311.700000 | 6.393321e+06 | 780801.370370 |
| std | 2.908872 | 37308.578943 | 300.161512 | 2.190696e+05 | 39569.845038 |
| min | 2014.000000 | 101405.000000 | 2867.000000 | 6.035180e+06 | 689820.000000 |
| 25% | 2016.000000 | 130929.000000 | 3041.500000 | 6.203962e+06 | 753962.000000 |
| 50% | 2018.500000 | 160334.500000 | 3254.000000 | 6.403496e+06 | 799486.000000 |
| 75% | 2021.000000 | 188321.250000 | 3545.750000 | 6.553819e+06 | 807843.500000 |
| max | 2023.000000 | 251921.000000 | 3903.000000 | 6.831809e+06 | 821822.000000 |
# Shape of dataset
sales_df.shape
(40, 6)
# Preview of dataset
stocks_df.head()
| Ticker | WALMEX.MX | ^GSPC | ^MXX |
|---|---|---|---|
| Date | |||
| 2014-02-03 | NaN | 1741.890015 | NaN |
| 2014-02-04 | 31.500000 | 1755.199951 | 40085.519531 |
| 2014-02-05 | 31.379999 | 1751.640015 | 39880.871094 |
| 2014-02-06 | 31.190001 | 1773.430054 | 40288.781250 |
| 2014-02-07 | 29.850000 | 1797.020020 | 40525.738281 |
stocks_df.tail()
| Ticker | WALMEX.MX | ^GSPC | ^MXX |
|---|---|---|---|
| Date | |||
| 2024-01-25 | 69.279999 | 4894.160156 | 56160.070312 |
| 2024-01-26 | 70.699997 | 4890.970215 | 56855.878906 |
| 2024-01-29 | 70.349998 | 4927.930176 | 57175.730469 |
| 2024-01-30 | 71.919998 | 4924.970215 | 57537.140625 |
| 2024-01-31 | 71.089996 | 4845.649902 | 57372.761719 |
# Basic info
stocks_df.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 2584 entries, 2014-02-03 to 2024-01-31 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 WALMEX.MX 2515 non-null float64 1 ^GSPC 2516 non-null float64 2 ^MXX 2514 non-null float64 dtypes: float64(3) memory usage: 80.8 KB
# Basic statistical description
stocks_df.describe()
| Ticker | WALMEX.MX | ^GSPC | ^MXX |
|---|---|---|---|
| count | 2515.000000 | 2516.000000 | 2514.000000 |
| mean | 52.796288 | 3030.773772 | 46462.182669 |
| std | 13.482746 | 909.474790 | 4687.222880 |
| min | 28.059999 | 1741.890015 | 32964.218750 |
| 25% | 42.039999 | 2138.630066 | 43406.756836 |
| 50% | 52.689999 | 2810.109985 | 46047.953125 |
| 75% | 65.530003 | 3930.392578 | 49939.346680 |
| max | 81.919998 | 4927.930176 | 57745.789062 |
# Shape of dataset
stocks_df.shape
(2584, 3)
So, the stock dataset comprises 2584 daily observations of the close value of WALMEX.MX, GSPC (S&P 500) and MXX (IPC).
# Preview of the dataset
gdp_df.head()
| TIME_PERIOD | OBS_VALUE | OBS_EXCEPTION | OBS_STATUS | OBS_SOURCE | OBS_NOTE | COBER_GEO | |
|---|---|---|---|---|---|---|---|
| 0 | 1980/01 | 10401367.607 | 17 | None | 00 | ||
| 1 | 1980/02 | 10342350 | 17 | None | 00 | ||
| 2 | 1980/03 | 10392733.012 | 17 | None | 00 | ||
| 3 | 1980/04 | 10927666.353 | 17 | None | 00 | ||
| 4 | 1981/01 | 11345848.491 | 17 | None | 00 |
gdp_df.tail()
| TIME_PERIOD | OBS_VALUE | OBS_EXCEPTION | OBS_STATUS | OBS_SOURCE | OBS_NOTE | COBER_GEO | |
|---|---|---|---|---|---|---|---|
| 171 | 2022/04 | 24981146.477 | 17 | None | 00 | ||
| 172 | 2023/01 | 24291956.149 | 17 | None | 00 | ||
| 173 | 2023/02 | 25001939.978 | 17 | None | 00 | ||
| 174 | 2023/03 | 25121269.269 | 17 | None | 00 | ||
| 175 | 2023/04 | 25596360.563 | 1 | 17 | None | 00 |
# Basic info of the dataset
gdp_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 176 entries, 0 to 175 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 TIME_PERIOD 176 non-null object 1 OBS_VALUE 176 non-null object 2 OBS_EXCEPTION 176 non-null object 3 OBS_STATUS 176 non-null object 4 OBS_SOURCE 176 non-null object 5 OBS_NOTE 0 non-null object 6 COBER_GEO 176 non-null object dtypes: object(7) memory usage: 9.8+ KB
# Basic statistical description
gdp_df.describe()
| TIME_PERIOD | OBS_VALUE | OBS_EXCEPTION | OBS_STATUS | OBS_SOURCE | OBS_NOTE | COBER_GEO | |
|---|---|---|---|---|---|---|---|
| count | 176 | 176 | 176 | 176 | 176 | 0 | 176 |
| unique | 176 | 176 | 1 | 3 | 1 | 0 | 1 |
| top | 1980/01 | 10401367.607 | 17 | NaN | 00 | ||
| freq | 1 | 1 | 176 | 174 | 176 | NaN | 176 |
# Shape of dataset
gdp_df.shape
(176, 7)
# Preview of the dataset
exchange_rates_df.head(10)
| Banco de México | Unnamed: 1 | |
|---|---|---|
| 0 | NaN | NaN |
| 1 | Exchange rates and auctions historical informa... | NaN |
| 2 | Tipos de cambio diarios | NaN |
| 3 | NaN | NaN |
| 4 | Date: 03/02/2024 06:14:08 | NaN |
| 5 | NaN | NaN |
| 6 | NaN | NaN |
| 7 | NaN | NaN |
| 8 | Title | Exchange rate pesos per US dollar, Used to set... |
| 9 | Information type | Levels |
# Preview of the dataset
exchange_rates_df.tail()
| Banco de México | Unnamed: 1 | |
|---|---|---|
| 3723 | 03/01/2024 | 17.0962 |
| 3724 | 03/02/2024 | 17.0633 |
| 3725 | 03/03/2024 | 17.0633 |
| 3726 | 03/04/2024 | 17.0633 |
| 3727 | 03/05/2024 | 17.0217 |
# Basic info of the dataset
exchange_rates_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3728 entries, 0 to 3727 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Banco de México 3723 non-null object 1 Unnamed: 1 3720 non-null object dtypes: object(2) memory usage: 58.4+ KB
# Basic statistical description
exchange_rates_df.describe()
| Banco de México | Unnamed: 1 | |
|---|---|---|
| count | 3723 | 3720 |
| unique | 3723 | 2494 |
| top | Exchange rates and auctions historical informa... | 20.1175 |
| freq | 1 | 7 |
# Shape of dataset
exchange_rates_df.shape
(3728, 2)
# Preview of the dataset
bonds_rates_df.head(10)
| Banco de México | Unnamed: 1 | |
|---|---|---|
| 0 | NaN | NaN |
| 1 | Securities auctions and Open Market Operations | NaN |
| 2 | Valores Gubernamentales | NaN |
| 3 | NaN | NaN |
| 4 | Date: 03/02/2024 06:05:53 | NaN |
| 5 | NaN | NaN |
| 6 | NaN | NaN |
| 7 | NaN | NaN |
| 8 | Title | Bonds Issued by Public Agencies Weekly auction... |
| 9 | Information type | Levels |
bonds_rates_df.tail()
| Banco de México | Unnamed: 1 | |
|---|---|---|
| 537 | 02/01/2024 | 11.15 |
| 538 | 02/08/2024 | 11.06 |
| 539 | 02/15/2024 | 11.05 |
| 540 | 02/22/2024 | 11 |
| 541 | 02/29/2024 | 11 |
# Basic info of the dataset
bonds_rates_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 542 entries, 0 to 541 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Banco de México 537 non-null object 1 Unnamed: 1 534 non-null object dtypes: object(2) memory usage: 8.6+ KB
bonds_rates_df.describe()
| Banco de México | Unnamed: 1 | |
|---|---|---|
| count | 537 | 534 |
| unique | 537 | 301 |
| top | Securities auctions and Open Market Operations | 3.05 |
| freq | 1 | 7 |
# Shape of dataset
bonds_rates_df.shape
(542, 2)
# Preview of the dataset
interest_rates_df.head(10)
| Banco de México | Unnamed: 1 | |
|---|---|---|
| 0 | NaN | NaN |
| 1 | Securities prices and interest rates | NaN |
| 2 | Tasas de Interés en el Mercado de Dinero | NaN |
| 3 | NaN | NaN |
| 4 | Date: 03/02/2024 06:10:39 | NaN |
| 5 | NaN | NaN |
| 6 | NaN | NaN |
| 7 | NaN | NaN |
| 8 | Title | 28 day TIIE, Interest rate in annual percent |
| 9 | Information type | Levels |
interest_rates_df.tail()
| Banco de México | Unnamed: 1 | |
|---|---|---|
| 2565 | 02/27/2024 | 11.4891 |
| 2566 | 02/28/2024 | 11.49 |
| 2567 | 02/29/2024 | 11.4875 |
| 2568 | 03/01/2024 | 11.4937 |
| 2569 | 03/04/2024 | 11.485 |
# Basic info of the dataset
interest_rates_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2570 entries, 0 to 2569 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Banco de México 2565 non-null object 1 Unnamed: 1 2562 non-null object dtypes: object(2) memory usage: 40.3+ KB
interest_rates_df.describe()
| Banco de México | Unnamed: 1 | |
|---|---|---|
| count | 2565 | 2562 |
| unique | 2565 | 1730 |
| top | Securities prices and interest rates | 3.3 |
| freq | 1 | 31 |
# Shape of dataset
interest_rates_df.shape
(2570, 2)
Missing data is common issue in real datasets. Thus, in the present section, the datasets were assessed to identify the number of missing values and its percentage.
# Function to calculate the percentage of missing values for each column in the dataset
def missing_values_summary(df):
"""
Calculates the number of missing values and its corresponding percentage of total values
for each column in a pandas dataframe.
Parameters:
df (pandas.dataframe): Input dataset.
Returns
mis_val_table_df (pandas.dataframe): Table with the number of missing values and its corresponding percentage for each column.
"""
mis_val = df.isnull().sum()
mis_val_percentage = (mis_val / len(df)) * 100
mis_val_table_df = pd.concat([pd.Series(mis_val.index), pd.Series(mis_val.values),
pd.Series(mis_val_percentage.values)], axis=1)
mis_val_table_df.columns = ['Attribute', 'Missing Values Count', 'Missing Values Rate (%)']
mis_val_table_df = mis_val_table_df[
mis_val_table_df.iloc[:,1] != 0].sort_values(
'Missing Values Rate (%)', ascending=False).round(2)
print ("Dataset has " + str(df.shape[1]) + " columns.\n"
"There are " + str(mis_val_table_df.shape[0]) + " attributes that have missing values.")
return mis_val_table_df
# Mising data in sales dataset
missing_values_summary(sales_df)
Dataset has 6 columns. There are 2 attributes that have missing values.
| Attribute | Missing Values Count | Missing Values Rate (%) | |
|---|---|---|---|
| 4 | Sq.mt. Mexico | 13 | 32.5 |
| 5 | Sq.mt. Central America | 13 | 32.5 |
# Mising data in stocks dataset
missing_values_summary(stocks_df)
Dataset has 3 columns. There are 3 attributes that have missing values.
| Attribute | Missing Values Count | Missing Values Rate (%) | |
|---|---|---|---|
| 2 | ^MXX | 70 | 2.71 |
| 0 | WALMEX.MX | 69 | 2.67 |
| 1 | ^GSPC | 68 | 2.63 |
# Mising data in gdp dataset
missing_values_summary(gdp_df)
Dataset has 7 columns. There are 1 attributes that have missing values.
| Attribute | Missing Values Count | Missing Values Rate (%) | |
|---|---|---|---|
| 5 | OBS_NOTE | 176 | 100.0 |
# Mising data in exchange rates dataset
missing_values_summary(exchange_rates_df)
Dataset has 2 columns. There are 2 attributes that have missing values.
| Attribute | Missing Values Count | Missing Values Rate (%) | |
|---|---|---|---|
| 1 | Unnamed: 1 | 8 | 0.21 |
| 0 | Banco de México | 5 | 0.13 |
# Mising data in bonds rates dataset
missing_values_summary(bonds_rates_df)
Dataset has 2 columns. There are 2 attributes that have missing values.
| Attribute | Missing Values Count | Missing Values Rate (%) | |
|---|---|---|---|
| 1 | Unnamed: 1 | 8 | 1.48 |
| 0 | Banco de México | 5 | 0.92 |
# Mising data in interest rates dataset
missing_values_summary(interest_rates_df)
Dataset has 2 columns. There are 2 attributes that have missing values.
| Attribute | Missing Values Count | Missing Values Rate (%) | |
|---|---|---|---|
| 1 | Unnamed: 1 | 8 | 0.31 |
| 0 | Banco de México | 5 | 0.19 |
So, several attributes in the datasets have missing values, the vast majority in a small extent. So, the method dropna() was sufficient to handle those missing values.
However, in the case of the attributes OBS_NOTE in the gdp dataset, or Sq.mt. Mexico and Sq.mt. Central America in the sales dataset, whose missing values rates (%) are higher than 30%, so they will be removed during the preparation step.
Likewise, the datasets were assessed to identify any extreme outliers according to the interquartile range criterion (NIST/SEMATECH, 2012).
# Function to calculate outliers based on the interquartile range
def count_outliers(series):
"""
Returns the number of suspected extreme outliers and its corresponding rate based on the rule of:
+-interquartile range * 3 (NIST/SEMATECH, 2012).
Parameters:
series (pandas.series): Vector of numerical data.
Returns:
outliers_count (int): Number of suspected extreme outliers in the input data.
outliers_percentage (float): Rate of suspected extreme outliers in the input data.
"""
x = series.dropna()
q3, q1 = np.percentile(x, [75, 25])
iqr = q3 - q1
lower_limit = q1 - iqr * 3
upper_limit = q3 + iqr * 3
mask = np.bitwise_and(x > lower_limit, x < upper_limit)
data_count = x.shape[0]
outliers_count = data_count - x.loc[mask].shape[0]
outliers_percentage = round((outliers_count / data_count) * 100, 2)
return outliers_count, outliers_percentage
def outliers_summary(df):
"""
Calculates the number of outliers and its corresponding percentage of total values for each numeric column.
Parameters:
df (pandas.dataframe): Input dataset.
Returns:
outliers_table_df (pandas.dataframe): Table with the number of suspected outliers and its corresponding rate for each column.
"""
outliers_count_list = []
outliers_percentage_list = []
columns_list = list(df.select_dtypes(include='number').columns)
for col in columns_list:
outliers_count, outliers_percentage = count_outliers(df[col])
outliers_count_list.append(outliers_count)
outliers_percentage_list.append(outliers_percentage)
outliers_dict = {'Attribute': columns_list,
'Outliers Count':outliers_count_list,
'Outliers Rate (%)': outliers_percentage_list}
outliers_table_df = pd.DataFrame(outliers_dict)
outliers_table_df = outliers_table_df.loc[outliers_table_df['Outliers Count']>0,:].sort_values(by='Outliers Count', ascending=False)
print ("Dataset has " + str(df.shape[1]) + " columns.\n"
"There are " + str(outliers_table_df.shape[0]) + " attributes that have suspected extreme outliers.")
return outliers_table_df
# Suspected outliers in sales dataset
outliers_summary(sales_df)
Dataset has 6 columns. There are 0 attributes that have suspected extreme outliers.
| Attribute | Outliers Count | Outliers Rate (%) |
|---|
# Suspected outliers in stocks dataset
outliers_summary(stocks_df)
Dataset has 3 columns. There are 0 attributes that have suspected extreme outliers.
| Attribute | Outliers Count | Outliers Rate (%) |
|---|
# Suspected outliers in GDP dataset
outliers_summary(gdp_df)
Dataset has 7 columns. There are 0 attributes that have suspected extreme outliers.
| Attribute | Outliers Count | Outliers Rate (%) |
|---|
# Suspected outliers in exchange rates dataset
outliers_summary(exchange_rates_df)
Dataset has 2 columns. There are 0 attributes that have suspected extreme outliers.
| Attribute | Outliers Count | Outliers Rate (%) |
|---|
# Suspected outliers in bonds rates dataset
outliers_summary(bonds_rates_df)
Dataset has 2 columns. There are 0 attributes that have suspected extreme outliers.
| Attribute | Outliers Count | Outliers Rate (%) |
|---|
# Suspected outliers in interest rates dataset
outliers_summary(interest_rates_df)
Dataset has 2 columns. There are 0 attributes that have suspected extreme outliers.
| Attribute | Outliers Count | Outliers Rate (%) |
|---|
Thus, the datasets are free from extreme outliers.
In this section, the data was initialy analyzed to calculate simple statistical measures, identify time trends, explore distributions and relationships for each data source.
The basic statistical measures of the sales dataset are shown below:
sales_df.drop(columns=['Year', 'Quarter']).describe()
| Net Sales (mdp) | Units | Sq.mt. Mexico | Sq.mt. Central America | |
|---|---|---|---|---|
| count | 40.000000 | 40.000000 | 2.700000e+01 | 27.000000 |
| mean | 160831.950000 | 3311.700000 | 6.393321e+06 | 780801.370370 |
| std | 37308.578943 | 300.161512 | 2.190696e+05 | 39569.845038 |
| min | 101405.000000 | 2867.000000 | 6.035180e+06 | 689820.000000 |
| 25% | 130929.000000 | 3041.500000 | 6.203962e+06 | 753962.000000 |
| 50% | 160334.500000 | 3254.000000 | 6.403496e+06 | 799486.000000 |
| 75% | 188321.250000 | 3545.750000 | 6.553819e+06 | 807843.500000 |
| max | 251921.000000 | 3903.000000 | 6.831809e+06 | 821822.000000 |
On the other hand, the basic statistical measures of the stocks dataset is as follows:
stocks_df.describe()
| Ticker | WALMEX.MX | ^GSPC | ^MXX |
|---|---|---|---|
| count | 2515.000000 | 2516.000000 | 2514.000000 |
| mean | 52.796288 | 3030.773772 | 46462.182669 |
| std | 13.482746 | 909.474790 | 4687.222880 |
| min | 28.059999 | 1741.890015 | 32964.218750 |
| 25% | 42.039999 | 2138.630066 | 43406.756836 |
| 50% | 52.689999 | 2810.109985 | 46047.953125 |
| 75% | 65.530003 | 3930.392578 | 49939.346680 |
| max | 81.919998 | 4927.930176 | 57745.789062 |
The statistical measures of the GDP dataset are shown below:
gdp_df[['OBS_VALUE']].astype(float).describe()
| OBS_VALUE | |
|---|---|
| count | 1.760000e+02 |
| mean | 1.739801e+07 |
| std | 4.650498e+06 |
| min | 1.034235e+07 |
| 25% | 1.270458e+07 |
| 50% | 1.783120e+07 |
| 75% | 2.148982e+07 |
| max | 2.559636e+07 |
Finally, the statistical measures of the exchange rates, bonds rates, and interest rates datasets are shown below:
exchange_rates_df.iloc[11:,1].astype(float).to_frame().describe()
| Unnamed: 1 | |
|---|---|
| count | 3717.000000 |
| mean | 18.466568 |
| std | 2.395290 |
| min | 12.846200 |
| 25% | 17.295700 |
| 50% | 18.938500 |
| 75% | 19.991300 |
| max | 25.118500 |
bonds_rates_df.iloc[11:,1].astype(float).to_frame().describe()
| Unnamed: 1 | |
|---|---|
| count | 531.000000 |
| mean | 6.160000 |
| std | 2.591843 |
| min | 2.430000 |
| 25% | 4.015000 |
| 50% | 6.240000 |
| 75% | 7.740000 |
| max | 11.400000 |
interest_rates_df.iloc[11:,1].astype(float).to_frame().describe()
| Unnamed: 1 | |
|---|---|
| count | 2559.000000 |
| mean | 6.515711 |
| std | 2.587540 |
| min | 3.274100 |
| 25% | 4.281500 |
| 50% | 6.600000 |
| 75% | 8.115950 |
| max | 11.566900 |
def plot_linechart(df, save=True):
""""
Function to plot a linechart using Seaborn.
Parameters:
df (pandas.dataframe): Input dataset.
save (bool): Indicates wether the plot should be saved on local disk.
Returns:
None
"""
columns = list(df.columns)
for col in columns:
plt.subplots(figsize=(7,5))
sns.lineplot(data=df,
x=df.index,
y=df[col],
legend=False
)
xlabel=df.index.name.title()
ylabel=df[col].name.title()
y_ticks = plt.gca().get_yticks()
plt.gca().set_yticklabels([f'{x:,.0f}' for x in y_ticks])
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(f'{ylabel} by {xlabel}')
# plt.xticks([])
if save == True:
plt.savefig(f'Images/fig_{ylabel.lower().replace("^", "").replace(".", "").replace(" ", "_")}_over_time.png',
bbox_inches = 'tight')
The time trends of the sales dataset are shown below:
plot_linechart(sales_df.drop(columns=['Year', 'Quarter']).rename_axis('Date'))
The net sales of WALMEX exhibits a positive trend with seasonality. Units, and commercial area in both Mexico and Central America showed a positive trend.
The time trends of the stocks dataset are plot below:
plot_linechart(stocks_df)
The stock value of WALMEX and S&P500 have a strong positive trend; whereas the Mexican IPC has a weak one.
Moreover, the time trend of the GDP dataset is shown below:
chart_title = 'GDP Over Time'
plot_linechart(gdp_df[['OBS_VALUE']].astype(float).rename_axis('Date'))
plt.title(chart_title)
plt.xlabel('Time')
plt.ylabel('GDP (million of MXN)')
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png', bbox_inches = 'tight')
plt.show()
Mexican GDP showed a positive trend over the last 10 years with a sharply temporary decrease during the Covid pandemic.
The time trend of the exchange rates dataset is shown below:
chart_title = 'Exchange Rates Over Time'
plot_linechart(exchange_rates_df.iloc[11:,1].astype(float).to_frame().rename_axis('Date'))
plt.title(chart_title)
plt.xlabel('Time')
plt.ylabel('Exchange Rates (%)')
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png', bbox_inches = 'tight')
plt.show()
The exchange rate showed a weak positive trend over the last ten years.
The time trend of the bonds rates dataset is shown below:
chart_title = 'Bond Rates Over Time'
plot_linechart(bonds_rates_df.iloc[11:,1].astype(float).to_frame().rename_axis('Date'))
plt.title(chart_title)
plt.xlabel('Time')
plt.ylabel('Bond Rates (%)')
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png', bbox_inches = 'tight')
plt.show()
The time trend of the interest rates dataset is shown below:
chart_title = 'Interest Rates Over Time'
plot_linechart(interest_rates_df.iloc[11:,1].astype(float).to_frame().rename_axis('Date'))
plt.title(chart_title)
plt.xlabel('Time')
plt.ylabel('Interest Rates (%)')
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png', bbox_inches = 'tight')
plt.show()
Both bonds and money market interest rates showed an arbitrary trend, as those rates are set by the Central Bank of Mexico according to their contractive inflation policy.
The distributions of the variables in the sales dataset are shown below by a means of histograms:
sns.histplot(sales_df['Net Sales (mdp)'])
plt.title('Distribution of Walmex Net Sales')
plt.xlabel('Net Sales (million of MXN)')
plt.ylabel('Count')
plt.show()
The net sales of WALMEX shows a normal distribution somewhat skewed to the right.
The distributions of the variables of the stocks dataset are shown below by a means of histograms:
sns.histplot(stocks_df['WALMEX.MX'])
plt.title('Distribution of WALMEX Stock Price')
plt.xlabel('Stock Price (MXN)')
plt.ylabel('Count')
plt.show()
The stock value for WALMEX seems to show three diferent distributions.
sns.histplot(stocks_df['^GSPC'])
plt.title('Distribution of S&P 500 Stock Price')
plt.xlabel('Stock Price (USD)')
plt.ylabel('Count')
plt.show()
The stock value of the S&P500 is skewed to the right.
sns.histplot(stocks_df['^MXX'])
plt.title('Distribution of IPC Stock Price')
plt.xlabel('Stock Price (MXN)')
plt.ylabel('Count')
plt.show()
The stock value of the IPC exhibits a normal distribution.
sns.histplot(sales_df['Units'])
plt.title('Distribution of Walmex Units')
plt.xlabel('Units')
plt.ylabel('Count')
plt.show()
The units data of WALMEX seems to show a uniform distribution.
sns.histplot(sales_df['Sq.mt. Mexico'])
plt.title('Distribution of Walmex Commercial Area in Mexico')
plt.xlabel(r'Area ($m^{2}$)')
plt.ylabel('Count')
plt.show()
The commercial area for Mexico barely resembles a normal distribuition.
sns.histplot(sales_df['Sq.mt. Central America'])
plt.title('Distribution of Walmex Commercial Area in Central America')
plt.xlabel(r'Area ($m^{2}$)')
plt.ylabel('Count')
plt.show()
The commercial area for Central America data is skewed to the left.
The distribution of the GDP dataset is shown below:
# Exploring distribution with histogram
sns.histplot(gdp_df[['OBS_VALUE']].astype(float))
plt.title('Distribution of Historic Values of GDP in Mexico')
plt.xlabel('GDP (million of MXN)')
plt.ylabel('Count')
plt.legend('')
plt.show()
The distribution of the GDP is skewed to the right.
The distribution of the exchange rates dataset is shown below:
# Exploring distribution with histogram
sns.histplot(exchange_rates_df.iloc[11:,1].astype(float).to_frame())
plt.title('Distribution of Historic Exchange Rates in Mexico')
plt.xlabel('Rate (%)')
plt.ylabel('Count')
plt.legend('')
plt.show()
The exchange rates dataset shows two distributions. The first distribution (at the left of the histogram) is skewed to the right, and the second one (at the center of the histogram) resembles a normal distribution.
The distribution of the bonds rates dataset is shown below:
# Exploring distribution with histogram
sns.histplot(bonds_rates_df.iloc[11:,1].astype(float).to_frame())
plt.title('Distribution of Historic Bond Rates in Mexico')
plt.xlabel('Rate (%)')
plt.ylabel('Count')
plt.legend('')
plt.show()
No distribution is noticeable in the bonds rates dataset.
The distribution of the interest rates dataset is shown below:
# Exploring distribution with histogram
sns.histplot(interest_rates_df.iloc[11:,1].astype(float).to_frame())
plt.title('Distribution of Historic Interest Rates in Mexico')
plt.xlabel('Rate (%)')
plt.ylabel('Count')
plt.legend('')
plt.show()
Likewise, no distribution is noticeable in the money market interest rates dataset.
The distribution and variables's relationships of the sales dataset is shown below by a means of a pairplot:
# Exploring distributions and correlations with pairplot
sns.pairplot(stocks_df)
plt.show()
The stock value of WALMEX and the S&P500 exhibited a strong positive correlation; whereas the weak positive correlation was observed between the stock value of WALMEX and the Mexican IPC.
A very weak positive correlation was seen between the S&P500 and the IPC.
The distribution and variables's relationships of the sales dataset is shown below by a means of a pairplot:
# Exploring distributions and correlations with pairplot
sns.pairplot(sales_df.drop(columns=['Year', 'Quarter']))
<seaborn.axisgrid.PairGrid at 0x1ae9eeae610>
A strong positive correlation was observed for all the variables net sales, units, and commercial area in Mexico and Central America.
Then, as both are quaterly-reported data, the net sales and the GDP were plot in a scatter plot to explore its relationship.
sns.regplot(x=sales_df["Net Sales (mdp)"], y=gdp_df[['OBS_VALUE']].astype(float).iloc[-40:,:])
plt.title('Correlation Between GDP and Net Sales')
plt.xlabel('Net Sales (million of MXN)')
plt.ylabel('GDP (million of MXN)')
plt.show()
A positive correlation was found between net sales and GDP.
Finally. a new dataframe with the exchange, bonds, and interest rates data was built to later explore its relationships.
rates_df = exchange_rates_df.copy()
rates_df = rates_df.iloc[11:].reset_index(drop=True).rename(columns={'Unnamed: 1':'exchange_rate'})
rates_df = pd.merge(rates_df,
bonds_rates_df.iloc[11:].rename(columns={'Unnamed: 1':'bonds_rate'}),
how='left',
on='Banco de México')
rates_df = pd.merge(rates_df,
interest_rates_df.iloc[11:].rename(columns={'Unnamed: 1':'interest_rate'}),
how='left',
on='Banco de México')
rates_df = rates_df.rename(columns={'Banco de México':'date'})
rates_df.head(10)
| date | exchange_rate | bonds_rate | interest_rate | |
|---|---|---|---|---|
| 0 | 01/01/2014 | 13.0652 | NaN | NaN |
| 1 | 01/02/2014 | 13.0652 | 3.16 | 3.795 |
| 2 | 01/03/2014 | 13.0843 | NaN | 3.795 |
| 3 | 01/04/2014 | 13.1011 | NaN | NaN |
| 4 | 01/05/2014 | 13.1011 | NaN | NaN |
| 5 | 01/06/2014 | 13.1011 | NaN | 3.7925 |
| 6 | 01/07/2014 | 13.0911 | NaN | 3.7962 |
| 7 | 01/08/2014 | 13.0905 | NaN | 3.7994 |
| 8 | 01/09/2014 | 13.0337 | 3.17 | 3.8005 |
| 9 | 01/10/2014 | 13.0846 | NaN | 3.7927 |
sns.pairplot(rates_df.drop(columns=['date']).astype(float))
<seaborn.axisgrid.PairGrid at 0x1ae9ed940a0>
As expectable, a strong positive correlation was found for the bonds rates and the money market interest rates.
On the other hand, no relationship was observed among the exchange rates and the bonds or interest rates.
After the data was explored, it was wrangled to build the appropriate dataset for modeling based on the purposes of the present study, and the quality of the data.
In this context, the following transformations were performed:
The sales dataset contains quarter net sales data and number of units; as well as commercial area in $m^{2}$ in both Mexico and Central America.
Net sales corresponds to the response variable in the model, so it was selected.
While the information for net sales and number of units is complete for the selected time period from 1 Feb 2014 to 1 Feb 2024, it was found that the attibutes for commercial area in $m^{2}$ in both Mexico and Central America suffer from a 32.5% of missing values. So it was decided to select only number of units from the sales dataset as a predictor in the model.
sales_df = sales_df.drop(columns=['Sq.mt. Mexico', 'Sq.mt. Central America'])
On the other hand, the stock dataset contains information regarding the stock value of the WALMEX, S&P500 and the IPC.
From the initial data analysis, it was found a strong positive relationship between the stock value of WALMEX and the S&P500; whereas the relationship between the stock value of WALMEX and the IPC was weaker.
However, as WALMEX operates mostly within the Mexican market, it was decided to select not only the S&P500 as a predictor but the IPC, too.
From the GDP dataset, the attributes 'TIME_PERIOD', 'OBS_VALUE', 'OBS_EXCEPTION', 'OBS_STATUS', 'OBS_SOURCE', 'OBS_NOTE', and 'COBER_GEO' were present in the dataset; where 'OBS_VALUE' was the quarterly estimation of the GDP in Mexico. In this sense, from the initial data analysis a positive correlation was found between the net sales of WALMEX and GDP. So, OBS VALUE was selected as a predictor in the model, along with the temporality attribute 'TIME_PERIOD'.
gdp_df = gdp_df[['TIME_PERIOD', 'OBS_VALUE']]
Then, from the exchange rates dataset, no meaningful correlation could be established between the historical exchange rates and the net sales of WALMEX. However, for sake of completeness, the exchange rates were selected as a predictor in the model.
Lastly, regarding the bonds rates and the interest rates datasets, it was found that the bonds rates and the money market interest rate were heavily correlated. So, to avoid multicollinearity issues within the model, only one variable from those two was selected.
Taking into account that the money market interest rates are more relevant to the proper functioning of businesses in the country in terms to access to capital from the financial system, and that the bonds rate dataset exhibited a higher percentage of missing values, it was decided to select only interest rates as a predictor in the model.
For the sales dataset, the attributes were renamed as follows:
sales_df = sales_df.rename(columns={'Year':'year',
'Quarter':'quarter',
'Net Sales (mdp)': 'net_sales',
'Units': 'units'})
For the stock dataset, the attributes were renamed as follows:
stocks_df = stocks_df.rename(columns={"WALMEX.MX": "walmex", "^MXX": "ipc", "^GSPC": "sp500"})
For the GDP dataset, the attributes were renamed as follows:
gdp_df = gdp_df.rename(columns={'TIME_PERIOD': 'year_quarter', 'OBS_VALUE': 'gdp'})
For the exchange rates dataset, the attributes were renamed as follows:
exchange_rates_df = exchange_rates_df.rename(columns={'Banco de México': 'date', 'Unnamed: 1': 'exchange_rates'})
For the interest rates dataset, the attributes were renamed as follows:
interest_rates_df = interest_rates_df.rename(columns={'Banco de México': 'date', 'Unnamed: 1': 'interest_rates'})
In this section, the data types of the different attributes from the datasets were adjusted.
sales_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 40 entries, 0 to 39 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 40 non-null int64 1 quarter 40 non-null object 2 net_sales 40 non-null int64 3 units 40 non-null int64 dtypes: int64(3), object(1) memory usage: 1.4+ KB
The attribute year in the sales dataset was transformed to a string data type, as it will further concatenated with the quarter attribute. Moreover, no math aggregations are applicable for year within the present study.
sales_df.year = sales_df.year.astype(str)
stocks_df.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 2584 entries, 2014-02-03 to 2024-01-31 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 walmex 2515 non-null float64 1 sp500 2516 non-null float64 2 ipc 2514 non-null float64 dtypes: float64(3) memory usage: 80.8 KB
The attributes in the stock dataset have the proper data types. So no altering was required.
gdp_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 176 entries, 0 to 175 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year_quarter 176 non-null object 1 gdp 176 non-null object dtypes: object(2) memory usage: 2.9+ KB
The attribute gdp in the GDP dataset was transformed to a float data type.
gdp_df.gdp = gdp_df.gdp.astype('float64')
exchange_rates_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3728 entries, 0 to 3727 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 3723 non-null object 1 exchange_rates 3720 non-null object dtypes: object(2) memory usage: 58.4+ KB
The attributes date and exchange_rates from the exchange rates dataset were transformed to date and float data types, respectively. To do so, first, all the non-numeric values values in the attributes were replaced by NaN values to avoid casting errors.
exchange_rates_df.date = pd.to_datetime(exchange_rates_df.date, errors='coerce').dt.date
exchange_rates_df.exchange_rates = exchange_rates_df.exchange_rates.replace({'[^0-9.]': np.nan}, regex=True).astype('float64')
interest_rates_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2570 entries, 0 to 2569 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 2565 non-null object 1 interest_rates 2562 non-null object dtypes: object(2) memory usage: 40.3+ KB
Likewise, the attributes date and interest_rates from the interest rates dataset were transformed to date and float data types, respectively. Likewise, the non-numeric values values in the attributes were replaced by NaN values.
interest_rates_df.date = pd.to_datetime(interest_rates_df.date, errors='coerce').dt.date
interest_rates_df.interest_rates = interest_rates_df.interest_rates.replace({'[^0-9.]': np.nan}, regex=True).astype('float64')
In the different datasets, the missing values were drop, as the data will be aggregated in a later step on a quarterly basis.
sales_df = sales_df.dropna()
stocks_df = stocks_df.dropna()
gdp_df = gdp_df.dropna()
exchange_rates_df = exchange_rates_df.dropna()
interest_rates_df = interest_rates_df.dropna()
The datasets have different reporting periods:
In this context, it was decided to add a new derived attribute year_quarter for the stocks, exchange rates, and interest rates datasets. This, in order to be able to match them with the quarterly reported datasets.
# Adding a year_quarter attribute to the stocks dataset
stocks_df['year'] = stocks_df.index.year
stocks_df['year_quarter'] = pd.PeriodIndex(stocks_df.index, freq='Q')
stocks_df.head()
| Ticker | walmex | sp500 | ipc | year | year_quarter |
|---|---|---|---|---|---|
| Date | |||||
| 2014-02-04 | 31.500000 | 1755.199951 | 40085.519531 | 2014 | 2014Q1 |
| 2014-02-05 | 31.379999 | 1751.640015 | 39880.871094 | 2014 | 2014Q1 |
| 2014-02-06 | 31.190001 | 1773.430054 | 40288.781250 | 2014 | 2014Q1 |
| 2014-02-07 | 29.850000 | 1797.020020 | 40525.738281 | 2014 | 2014Q1 |
| 2014-02-10 | 29.740000 | 1799.839966 | 40116.359375 | 2014 | 2014Q1 |
# Adding a year_quarter attribute to the exchange rates dataset
exchange_rates_df.date = pd.to_datetime(exchange_rates_df.date)
exchange_rates_df = exchange_rates_df.set_index('date')
exchange_rates_df['year'] = exchange_rates_df.index.year
exchange_rates_df['year_quarter'] = pd.PeriodIndex(exchange_rates_df.index, freq='Q')
# Adding a year_quarter attribute to the interest rates dataset
interest_rates_df.date = pd.to_datetime(interest_rates_df.date)
interest_rates_df = interest_rates_df.set_index('date')
interest_rates_df['year'] = interest_rates_df.index.year
interest_rates_df['year_quarter'] = pd.PeriodIndex(interest_rates_df.index, freq='Q')
Moreover, the nomenclature of the year_quarter attribute in the GDP dataset was harmonized with the one created by the method PeriodIndex from Pandas.
gdp_df.head()
| year_quarter | gdp | |
|---|---|---|
| 0 | 1980/01 | 1.040137e+07 |
| 1 | 1980/02 | 1.034235e+07 |
| 2 | 1980/03 | 1.039273e+07 |
| 3 | 1980/04 | 1.092767e+07 |
| 4 | 1981/01 | 1.134585e+07 |
gdp_df.year_quarter = gdp_df.year_quarter.apply(lambda x: x.replace('/0', 'Q'))
gdp_df.head()
| year_quarter | gdp | |
|---|---|---|
| 0 | 1980Q1 | 1.040137e+07 |
| 1 | 1980Q2 | 1.034235e+07 |
| 2 | 1980Q3 | 1.039273e+07 |
| 3 | 1980Q4 | 1.092767e+07 |
| 4 | 1981Q1 | 1.134585e+07 |
Finally, a new attribute year_quarter was introduced in the sales dataset:
sales_df['year_quarter'] = sales_df.year + sales_df.quarter
sales_df.head()
| year | quarter | net_sales | units | year_quarter | |
|---|---|---|---|---|---|
| 0 | 2014 | Q1 | 101405 | 2867 | 2014Q1 |
| 1 | 2014 | Q2 | 103300 | 2879 | 2014Q2 |
| 2 | 2014 | Q3 | 104367 | 2904 | 2014Q3 |
| 3 | 2014 | Q4 | 128586 | 2983 | 2014Q4 |
| 4 | 2015 | Q1 | 110875 | 2987 | 2015Q1 |
The sales and GDP datasets are reported on a quarterly basis. So, it was decided to aggregate the stocks, exchange rates and interest rates datasets on a quarterly basis to harmonize the reporting time period among all datasets, and further join them in a next step.
As aggregation calculation, the mean was used for the three datasets.
# Aggregating stocks dataset on a quarterly basis
stocks_df = (stocks_df.groupby(by='year_quarter', as_index=False).mean()
.drop(columns='year').reset_index(drop=True))
stocks_df.head()
| Ticker | year_quarter | walmex | sp500 | ipc |
|---|---|---|---|---|
| 0 | 2014Q1 | 29.680789 | 1843.603426 | 39528.067331 |
| 1 | 2014Q2 | 33.320492 | 1901.228686 | 41669.394467 |
| 2 | 2014Q3 | 34.399206 | 1975.541901 | 44788.734003 |
| 3 | 2014Q4 | 30.782623 | 2007.633117 | 43557.017354 |
| 4 | 2015Q1 | 33.806780 | 2064.119323 | 42891.938957 |
# Aggregating exchange rates dataset on a quarterly basis
exchange_rates_df = (exchange_rates_df.groupby(by='year_quarter', as_index=False).mean()
.drop(columns='year').reset_index(drop=True))
exchange_rates_df.head()
| year_quarter | exchange_rates | |
|---|---|---|
| 0 | 2014Q1 | 13.233682 |
| 1 | 2014Q2 | 13.002503 |
| 2 | 2014Q3 | 13.110359 |
| 3 | 2014Q4 | 13.841998 |
| 4 | 2015Q1 | 14.930269 |
# Aggregating exchange rates dataset on a quarterly basis
interest_rates_df = (interest_rates_df.groupby(by='year_quarter', as_index=False).mean()
.drop(columns='year').reset_index(drop=True))
interest_rates_df.head()
| year_quarter | interest_rates | |
|---|---|---|
| 0 | 2014Q1 | 3.789890 |
| 1 | 2014Q2 | 3.677145 |
| 2 | 2014Q3 | 3.295198 |
| 3 | 2014Q4 | 3.293132 |
| 4 | 2015Q1 | 3.300654 |
Finally, all the datasets were joined into a single dataset using the year_quarter attribute.
# Adjusting Datatypes on primary keys
stocks_df.year_quarter = stocks_df.year_quarter.astype(str)
exchange_rates_df.year_quarter = exchange_rates_df.year_quarter.astype(str)
interest_rates_df.year_quarter = interest_rates_df.year_quarter.astype(str)
# Merging datasets
df = (sales_df.merge(gdp_df, how='inner', on='year_quarter')
.merge(stocks_df, how='inner', on='year_quarter')
.merge(exchange_rates_df, how='inner', on='year_quarter')
.merge(interest_rates_df, how='inner', on='year_quarter')
)
df.head()
| year | quarter | net_sales | units | year_quarter | gdp | walmex | sp500 | ipc | exchange_rates | interest_rates | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2014 | Q1 | 101405 | 2867 | 2014Q1 | 2.154765e+07 | 29.680789 | 1843.603426 | 39528.067331 | 13.233682 | 3.789890 |
| 1 | 2014 | Q2 | 103300 | 2879 | 2014Q2 | 2.232009e+07 | 33.320492 | 1901.228686 | 41669.394467 | 13.002503 | 3.677145 |
| 2 | 2014 | Q3 | 104367 | 2904 | 2014Q3 | 2.225821e+07 | 34.399206 | 1975.541901 | 44788.734003 | 13.110359 | 3.295198 |
| 3 | 2014 | Q4 | 128586 | 2983 | 2014Q4 | 2.293982e+07 | 30.782623 | 2007.633117 | 43557.017354 | 13.841998 | 3.293132 |
| 4 | 2015 | Q1 | 110875 | 2987 | 2015Q1 | 2.217008e+07 | 33.806780 | 2064.119323 | 42891.938957 | 14.930269 | 3.300654 |
# Drop year and quarter columns
df = df.drop(columns=['year', 'quarter']).set_index('year_quarter')
df.head()
| net_sales | units | gdp | walmex | sp500 | ipc | exchange_rates | interest_rates | |
|---|---|---|---|---|---|---|---|---|
| year_quarter | ||||||||
| 2014Q1 | 101405 | 2867 | 2.154765e+07 | 29.680789 | 1843.603426 | 39528.067331 | 13.233682 | 3.789890 |
| 2014Q2 | 103300 | 2879 | 2.232009e+07 | 33.320492 | 1901.228686 | 41669.394467 | 13.002503 | 3.677145 |
| 2014Q3 | 104367 | 2904 | 2.225821e+07 | 34.399206 | 1975.541901 | 44788.734003 | 13.110359 | 3.295198 |
| 2014Q4 | 128586 | 2983 | 2.293982e+07 | 30.782623 | 2007.633117 | 43557.017354 | 13.841998 | 3.293132 |
| 2015Q1 | 110875 | 2987 | 2.217008e+07 | 33.806780 | 2064.119323 | 42891.938957 | 14.930269 | 3.300654 |
df.tail()
| net_sales | units | gdp | walmex | sp500 | ipc | exchange_rates | interest_rates | |
|---|---|---|---|---|---|---|---|---|
| year_quarter | ||||||||
| 2022Q4 | 236272 | 3745 | 2.498115e+07 | 73.022500 | 3849.569010 | 49374.139648 | 19.699096 | 10.029444 |
| 2023Q1 | 204601 | 3755 | 2.429196e+07 | 72.581333 | 3999.022510 | 53196.447331 | 18.704144 | 11.071640 |
| 2023Q2 | 212164 | 3775 | 2.500194e+07 | 70.336667 | 4208.393827 | 54292.645443 | 17.723184 | 11.521205 |
| 2023Q3 | 211436 | 3802 | 2.512127e+07 | 67.776825 | 4458.137447 | 53210.645523 | 17.058150 | 11.501251 |
| 2023Q4 | 251921 | 3903 | 2.559636e+07 | 66.062000 | 4463.005339 | 52375.833008 | 17.582583 | 11.504218 |
Finally, the index of the integrated dataset was converted to a date data type to ease the application of the time series modeling below.
df.index = pd.to_datetime(df.index)
df = df.rename_axis('date')
Now, the dataset is ready for the Exploratory Data Analysis.
# Exporting processed dataset
df.to_csv('dataset_processed.csv', index=True)
The dataset for the time series analysis was split into predictors ($X$) and the response variable ($y$) as follows for easing the subsequent statistical tests before modeling:
X = df[df.columns.difference(['net_sales'], sort=False)]
y = df['net_sales']
X.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 40 entries, 2014-01-01 to 2023-10-01 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 units 40 non-null int64 1 gdp 40 non-null float64 2 walmex 40 non-null float64 3 sp500 40 non-null float64 4 ipc 40 non-null float64 5 exchange_rates 40 non-null float64 6 interest_rates 40 non-null float64 dtypes: float64(6), int64(1) memory usage: 2.5 KB
X_vals = X.values
y_vals = y.values
In this section, the data was explored to calculate simple statistical measures, identify time trends, explore distributions and relationships from the unified dataset.
df.describe()
| net_sales | units | gdp | walmex | sp500 | ipc | exchange_rates | interest_rates | |
|---|---|---|---|---|---|---|---|---|
| count | 40.000000 | 40.000000 | 4.000000e+01 | 40.000000 | 40.000000 | 40.000000 | 40.000000 | 40.000000 |
| mean | 160831.950000 | 3311.700000 | 2.349702e+07 | 52.416249 | 3006.016493 | 46309.976007 | 18.490338 | 6.431333 |
| std | 37308.578943 | 300.161512 | 1.136271e+06 | 13.595784 | 905.561387 | 4472.673823 | 2.382935 | 2.550068 |
| min | 101405.000000 | 2867.000000 | 1.920180e+07 | 29.680789 | 1843.603426 | 36401.019723 | 13.002503 | 3.293132 |
| 25% | 130929.000000 | 3041.500000 | 2.299367e+07 | 42.609978 | 2147.388925 | 43350.607976 | 17.688033 | 4.242145 |
| 50% | 160334.500000 | 3254.000000 | 2.347919e+07 | 52.997484 | 2793.243389 | 46092.678673 | 19.051471 | 6.221824 |
| 75% | 188321.250000 | 3545.750000 | 2.422661e+07 | 64.671428 | 3894.647899 | 49392.656242 | 20.017630 | 8.143081 |
| max | 251921.000000 | 3903.000000 | 2.559636e+07 | 74.324166 | 4600.347089 | 54292.645443 | 23.365513 | 11.521205 |
A hihg-level view of the time series for all the features is shown below:
# List of indexes
col_list = range(len(df.columns))
# Adjusting figure size
plt.figure(figsize=(8,7))
# Loop
i = 1
for col in col_list:
# Plot
ax = plt.subplot(len(col_list), 1, i)
plt.plot(df.iloc[:,col])
plt.title(df.columns[col], y=0.5, loc='left')
plt.grid(False)
# Removing labels and ticks
plt.ylabel('')
plt.xlabel('')
#plt.yticks([])
plt.xticks([])
i += 1
# Adjusting ticks in plot
# max_xticks = 15
# xloc = plt.MaxNLocator(max_xticks)
# ax.xaxis.set_major_locator(xloc)
# plt.xticks(rotation=45)
# Adding a x label
plt.xlabel('Time')
plt.savefig(f'Images/fig_{"variables over time".lower().replace(" ", "_")}.png', bbox_inches = 'tight')
plt.show()
However, individual line charts were plot to assess with more detail the time trends and seasonality for each feature:
plot_linechart(df)
From the plots above, it can be seen that all the variables exhibited a positive trend over time.
sns.pairplot(df)
plt.savefig(f'Images/fig_{"variables distributions and relationships".lower().replace(" ", "_")}.png', bbox_inches = 'tight')
From the plot above, none of the predictors followed a normal distribution.
Moreover, it can be seen that the variables units, S&P500, WALMEX stock, and net sales have a positive relationship; whereas the GDP and interest rates showed a moderate positive relationship.
df.corr()
| net_sales | units | gdp | walmex | sp500 | ipc | exchange_rates | interest_rates | |
|---|---|---|---|---|---|---|---|---|
| net_sales | 1.000000 | 0.941812 | 0.591655 | 0.881404 | 0.876755 | 0.497426 | 0.545840 | 0.706790 |
| units | 0.941812 | 1.000000 | 0.471083 | 0.951423 | 0.950654 | 0.481692 | 0.539032 | 0.665831 |
| gdp | 0.591655 | 0.471083 | 1.000000 | 0.458073 | 0.435537 | 0.668256 | 0.103462 | 0.690051 |
| walmex | 0.881404 | 0.951423 | 0.458073 | 1.000000 | 0.949792 | 0.539263 | 0.613590 | 0.597886 |
| sp500 | 0.876755 | 0.950654 | 0.435537 | 0.949792 | 1.000000 | 0.575764 | 0.524288 | 0.536097 |
| ipc | 0.497426 | 0.481692 | 0.668256 | 0.539263 | 0.575764 | 1.000000 | 0.070975 | 0.505446 |
| exchange_rates | 0.545840 | 0.539032 | 0.103462 | 0.613590 | 0.524288 | 0.070975 | 1.000000 | 0.317116 |
| interest_rates | 0.706790 | 0.665831 | 0.690051 | 0.597886 | 0.536097 | 0.505446 | 0.317116 | 1.000000 |
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(df.corr(), cmap=cmap, center=0.5, linewidths=.5, cbar_kws={"shrink": .5}, annot=True)
plt.title('Correlation Matrix')
plt.xlabel('Attributes')
plt.ylabel('Attributes')
plt.savefig(f'Images/fig_{"correlation matrix".lower().replace(" ", "_")}.png', bbox_inches = 'tight')
plt.show()
Indeed, from the table above, it can be seen that the correlation among net sales, units, WALMEX stock, and S&P500 is strong; whereas GDP, interest rates and IPC are moderately correlated.
In contrast, exchange rates, GDP and IPC are weakly correlated.
As strong correlations among the attributes in the dataset were found, it was decided to calculate the differences among observations for each variable.
df.diff()
| net_sales | units | gdp | walmex | sp500 | ipc | exchange_rates | interest_rates | |
|---|---|---|---|---|---|---|---|---|
| date | ||||||||
| 2014-01-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2014-04-01 | 1895.0 | 12.0 | 772439.473 | 3.639703 | 57.625260 | 2141.327136 | -0.231179 | -0.112745 |
| 2014-07-01 | 1067.0 | 25.0 | -61883.856 | 1.078714 | 74.313215 | 3119.339536 | 0.107855 | -0.381947 |
| 2014-10-01 | 24219.0 | 79.0 | 681607.627 | -3.616583 | 32.091216 | -1231.716649 | 0.731639 | -0.002067 |
| 2015-01-01 | -17711.0 | 4.0 | -769739.789 | 3.024157 | 56.486206 | -665.078397 | 1.088271 | 0.007522 |
| 2015-04-01 | 2675.0 | 6.0 | 673107.361 | 4.392073 | 38.175439 | 2099.669970 | 0.379645 | 0.000254 |
| 2015-07-01 | 2908.0 | 11.0 | 168436.267 | 1.374798 | -74.588570 | -1127.769393 | 1.091249 | 0.016643 |
| 2015-10-01 | 28523.0 | 61.0 | 436115.112 | 4.159252 | 23.757037 | 16.158072 | 0.348193 | 0.029321 |
| 2016-01-01 | -19467.0 | -9.0 | -911954.852 | -0.622731 | -103.225304 | -638.493498 | 1.266760 | 0.443094 |
| 2016-04-01 | 1094.0 | 15.0 | 823087.967 | -0.458297 | 126.893481 | 2230.406048 | 0.035913 | 0.290438 |
| 2016-07-01 | -733.0 | -94.0 | -144280.433 | 0.836220 | 87.288906 | 1679.088108 | 0.666134 | 0.506849 |
| 2016-10-01 | 30640.0 | 45.0 | 770137.090 | -4.150428 | 22.484175 | -668.554319 | 1.108795 | 0.842175 |
| 2017-01-01 | -24805.0 | 6.0 | -674041.839 | -0.635500 | 140.787174 | 856.216927 | 0.562376 | 0.966881 |
| 2017-04-01 | 3082.0 | 18.0 | 403515.674 | 4.053571 | 73.568811 | 1904.803587 | -1.790742 | 0.633231 |
| 2017-07-01 | 11843.0 | 159.0 | -222953.221 | -0.271452 | 67.914152 | 1606.913207 | -0.776203 | 0.343346 |
| 2017-10-01 | 20480.0 | -71.0 | 829054.680 | 1.899047 | 135.831554 | -2131.552536 | 1.111891 | 0.045699 |
| 2018-01-01 | -23162.0 | 22.0 | -768826.439 | 1.743736 | 133.879166 | 50.576675 | -0.170510 | 0.326388 |
| 2018-04-01 | 1523.0 | 26.0 | 830388.254 | 4.231661 | -32.986940 | -1909.600156 | 0.608665 | 0.127860 |
| 2018-07-01 | 1159.0 | 23.0 | -243029.643 | 3.691905 | 145.703024 | 2302.189422 | -0.394126 | 0.235060 |
| 2018-10-01 | 29488.0 | 44.0 | 495679.257 | -2.106301 | -150.142597 | -5061.005004 | 0.850446 | 0.140761 |
| 2019-01-01 | -25589.0 | 10.0 | -820271.023 | -1.607554 | 19.582852 | -1046.752090 | -0.607683 | 0.310076 |
| 2019-04-01 | 2615.0 | 32.0 | 398092.413 | 4.108466 | 162.235186 | 750.026780 | -0.096438 | -0.054586 |
| 2019-07-01 | 11005.0 | 161.0 | -158157.609 | 1.407294 | 75.922653 | -2129.911552 | 0.292759 | -0.150171 |
| 2019-10-01 | 30570.0 | 37.0 | 328696.864 | 0.041976 | 123.538524 | 1712.626726 | -0.135783 | -0.491249 |
| 2020-01-01 | -24767.0 | -73.0 | -938115.399 | 0.184734 | -16.924056 | -1333.406271 | 0.597199 | -0.520779 |
| 2020-04-01 | -2082.0 | 12.0 | -4242953.706 | 0.196816 | -128.147437 | -5652.549938 | 3.486704 | -1.294603 |
| 2020-07-01 | -3721.0 | 24.0 | 2828564.567 | -1.915431 | 383.125359 | 887.166351 | -1.254566 | -1.066331 |
| 2020-10-01 | 30570.0 | 37.0 | 1436766.856 | -0.226280 | 238.480987 | 3408.909188 | -1.479114 | -0.475092 |
| 2021-01-01 | -25390.0 | 24.0 | -731247.989 | 8.026662 | 308.241010 | 5005.818138 | -0.311898 | -0.137137 |
| 2021-04-01 | 4044.0 | 25.0 | 662532.377 | 2.046887 | 320.753682 | 3745.292623 | -0.270193 | -0.072722 |
| 2021-07-01 | 1199.0 | 31.0 | -236375.579 | 4.341144 | 234.492389 | 1653.000760 | -0.040312 | 0.339241 |
| 2021-10-01 | 38486.0 | 51.0 | 744441.461 | 4.439178 | 179.587107 | 436.072010 | 0.736949 | 0.554809 |
| 2022-01-01 | -26654.0 | 11.0 | -450075.199 | 1.335941 | -136.780602 | 1562.314241 | -0.222159 | 0.842042 |
| 2022-04-01 | 7649.0 | 20.0 | 720026.474 | -1.427117 | -362.603037 | -1997.190050 | -0.481987 | 1.039025 |
| 2022-07-01 | 2263.0 | 26.0 | 92985.832 | -1.422605 | -118.913592 | -4065.328641 | 0.200000 | 1.404996 |
| 2022-10-01 | 39951.0 | 68.0 | 711724.075 | 1.548056 | -132.480849 | 2337.065306 | -0.543136 | 1.553271 |
| 2023-01-01 | -31671.0 | 10.0 | -689190.328 | -0.441167 | 149.453499 | 3822.307682 | -0.994951 | 1.042196 |
| 2023-04-01 | 7563.0 | 20.0 | 709983.829 | -2.244667 | 209.371318 | 1096.198112 | -0.980961 | 0.449565 |
| 2023-07-01 | -728.0 | 27.0 | 119329.291 | -2.559842 | 249.743620 | -1081.999919 | -0.665034 | -0.019954 |
| 2023-10-01 | 40485.0 | 101.0 | 475091.294 | -1.714826 | 4.867891 | -834.812516 | 0.524433 | 0.002967 |
sns.pairplot(df.diff())
<seaborn.axisgrid.PairGrid at 0x1aea2739400>
The differences did not exhibit any clear relationship among them, and they tended to follow a closer normal distribution. To confirm these results, the Pearson's correlation coefficient was calculated for the differences.
df.diff().corr()
| net_sales | units | gdp | walmex | sp500 | ipc | exchange_rates | interest_rates | |
|---|---|---|---|---|---|---|---|---|
| net_sales | 1.000000 | 0.438193 | 0.471910 | -0.143753 | -0.086474 | -0.143666 | 0.057721 | -0.013358 |
| units | 0.438193 | 1.000000 | 0.138183 | -0.094973 | -0.006673 | -0.018928 | -0.177507 | 0.028760 |
| gdp | 0.471910 | 0.138183 | 1.000000 | -0.087999 | 0.281858 | 0.279658 | -0.543744 | 0.105603 |
| walmex | -0.143753 | -0.094973 | -0.087999 | 1.000000 | 0.224537 | 0.394898 | -0.035092 | -0.092308 |
| sp500 | -0.086474 | -0.006673 | 0.281858 | 0.224537 | 1.000000 | 0.572800 | -0.399108 | -0.347191 |
| ipc | -0.143666 | -0.018928 | 0.279658 | 0.394898 | 0.572800 | 1.000000 | -0.615247 | 0.090324 |
| exchange_rates | 0.057721 | -0.177507 | -0.543744 | -0.035092 | -0.399108 | -0.615247 | 1.000000 | -0.236574 |
| interest_rates | -0.013358 | 0.028760 | 0.105603 | -0.092308 | -0.347191 | 0.090324 | -0.236574 | 1.000000 |
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(df.diff().corr(), cmap=cmap, center=0, linewidths=.5, cbar_kws={"shrink": .5}, annot=True)
plt.title('Correlation Matrix')
plt.xlabel('Attributes')
plt.ylabel('Attributes')
Text(47.10937499999999, 0.5, 'Attributes')
Indeed, according to the table above, only weak or moderate correlations were found among the differences.
In this context, it was decided to not include the differences within the dataset for the regression model.
# Concatenating the differences attributes to the dataset
# df_regression = df.copy()
# df_regression = pd.concat([df_regression, df_regression.diff().add_suffix('_diff')], axis=1).dropna()
# Splitting the predictors and response variable for the dataset including differences
# X_reg = df_regression[df_regression.columns.difference(['net_sales', 'net_sales_diff'], sort=False)]
# y_reg = df_regression['net_sales']
The stationarity of the net sales time series was assessed by testing for unit roots with the Augmented Dickey-Fuller (ADF) test (Peixeiro, 2022), using a 95% confidence level:
$$H_{0}: Non\text{-}stationary$$$$H_{1}: Stationary$$$$\alpha = 0.95$$stationarity_test = adfuller(y_vals, autolag='AIC')
print("ADF statistic: ", stationarity_test[0])
print("P-value: ", stationarity_test[1])
ADF statistic: 1.3627370980379012 P-value: 0.9969381097148852
As the $p\text{-}value > 0.05$ then the null hypothesis cannot be rejected. So, the time series is not stationary.
In this context, the net sales data was transformed using differencing to stabilize the trend and seasonality (Peixeiro, 2022):
# First order differencing
y_vals_diff = np.diff(y_vals, n=1)
y_diff = pd.DataFrame(y_vals_diff, columns=['Net Sales'], index=y.index[1:])
chart_title = 'First-Order Differenced Net Sales'
plot_linechart(y_diff, save=False)
plt.title(chart_title)
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png', bbox_inches = 'tight')
plt.show()
Indeed, it seems that the differencing of the net sales allowed to stabilize trend thereof. So, the ADF test was applied to the first-order differenced net sales:
stationarity_test_diff = adfuller(y_vals_diff, autolag='AIC')
print("ADF statistic: ", stationarity_test_diff[0])
print("P-value: ", stationarity_test_diff[1])
ADF statistic: -1.6511761047174953 P-value: 0.45642712357493526
Even though the ADF statistic is now negative, the $p\text{-}value > 0.05$, so the null hypothesis cannot be rejected. So, the time series is still not stationary after the first-order differencing.
Thus, the net sales were transformed again with a second-order differencing, and the ADF test was applied once more.
# Second-order differencing
y_vals_diff2 = np.diff(y_vals_diff, n=1)
y_diff2 = pd.DataFrame(y_vals_diff2, columns=['Net Sales'], index=y.index[2:])
chart_title = 'Second-Order Differenced Net Sales'
plot_linechart(y_diff2, save=False)
plt.title(chart_title)
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png', bbox_inches = 'tight')
plt.show()
stationarity_test_diff2 = adfuller(y_vals_diff2, autolag='AIC')
print("ADF statistic: ", stationarity_test_diff2[0])
print("P-value: ", stationarity_test_diff2[1])
ADF statistic: -4.941259634510584 P-value: 2.89677222031049e-05
Now, the ADF statistic is strongly negative and the $p\text{-}value < 0.05$, so the null hypothesis is rejected. The second-order differenced time series is stationary.
Moreover, as the VAR model requires all the time series to be stationary as well (Peixeiro, 2022), the stationarity of the other macroeconomic variables was also tested with the Augmented Dickey-Fuller (ADF) test:
$$H_{0}: Non\text{-}stationary$$$$H_{1}: Stationary$$$$\alpha = 0.95$$def test_stationarity(df, alpha):
"""
Tests for stationarity by applying the Augmented Dickey-Fuller (ADF) test to each of the columns
in the input dataset, at the provided significance level.
Parameters:
df (pandas.dataframe): Dataset with the variables to be tested for stationarity.
alpha (float): Specified significance level.
Returns:
None
"""
for col in df.columns:
stationarity_test = adfuller(df[col], autolag='AIC')
print(f'{col}:')
print(f"ADF statistic: {stationarity_test[0]:.03f}")
print(f"P-value: {stationarity_test[1]:.03f}")
if stationarity_test[1] <= alpha:
print("The series is stationary.\n")
else:
print("The series is not stationary.\n")
test_stationarity(X, 0.05)
units: ADF statistic: 0.403 P-value: 0.982 The series is not stationary. gdp: ADF statistic: -3.197 P-value: 0.020 The series is stationary. walmex: ADF statistic: -1.143 P-value: 0.697 The series is not stationary. sp500: ADF statistic: -0.191 P-value: 0.940 The series is not stationary. ipc: ADF statistic: -2.044 P-value: 0.268 The series is not stationary. exchange_rates: ADF statistic: -2.308 P-value: 0.169 The series is not stationary. interest_rates: ADF statistic: -3.185 P-value: 0.021 The series is stationary.
Thus, only the gdp and interest_rates series are stationary, while the other were not. So, for simplicity, first-order differencing transformations were applied to all the series in order to simplify the detransformation when doing predictions with the VAR model.
# First-order differencing
X_diff = X.diff().dropna()
test_stationarity(X_diff, 0.05)
units: ADF statistic: -5.445 P-value: 0.000 The series is stationary. gdp: ADF statistic: -6.037 P-value: 0.000 The series is stationary. walmex: ADF statistic: -4.932 P-value: 0.000 The series is stationary. sp500: ADF statistic: -4.322 P-value: 0.000 The series is stationary. ipc: ADF statistic: -5.091 P-value: 0.000 The series is stationary. exchange_rates: ADF statistic: -4.562 P-value: 0.000 The series is stationary. interest_rates: ADF statistic: -3.253 P-value: 0.017 The series is stationary.
Thus, with the first-order differencing was enough to render all the macroeconomic indicators series as stationaries.
The autocorrelation of the net sales time series was assessed by means of the Autorcorrelation Function (ACF) plot, to explore the significant coefficients/lags in the series.
chart_title = 'Net Sales Autocorrelation'
plot_acf(y_vals, lags=20)
plt.title(chart_title)
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png', bbox_inches = 'tight')
plt.show()
Thus, according to the ACF plot, the lags 2, 3, 4, and 5 are significant, which means that the time series is actually dependent on its past values.
chart_title = 'First-Order Differenced Net Sales Autocorrelation'
plot_acf(y_vals_diff, lags=20)
plt.title(chart_title)
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png', bbox_inches = 'tight')
plt.show()
In this case, as there are no consecutive significant lags after lag 2, it is possible to state that the first-order differenced net sales is autocorrelated only at lags 1 and 2.
chart_title = 'Second-Order Differenced Net Sales Autocorrelation'
plot_acf(y_vals_diff2, lags=20)
plt.title(chart_title)
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png', bbox_inches = 'tight')
plt.show()
Likewise, as there are no consecutive significant lags after lag 2, it is possible to state that the second-order differenced net sales is autocorrelated only at lags 1 and 2, too. However, it is clear that the overall trend of the ACF follows a sinusoidal-like pattern. This is evidence that the time series is autorregressive (Pexerio, 2022).
As an important conclusion so far, as the ADF test of the first-order differenced net sales showed that the time series was not stationary, and the ACF plot of the first-order differenced net sales showed an autocorrelation at lag 2, then, the net sales time series is not a random walk. This means that statistical learning techniques can be successfully applied to estimate forecasts on the data (Peixeiro, 2022).
As part of the explotatory data analysis, the WALMEX net sales series was decomposed using the seasonal_decompose method from the statsmodels library.
Firstly, an additive decomposition was performed:
chart_title = 'Additive Decomposition for WALMEX Net Sales'
decomposition = seasonal_decompose(y, model='additive')
decomposition.plot()
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png', bbox_inches = 'tight')
plt.show()
From the plot above, it is clear that the WALMEX net sales adjusted very well to a additive model, as the trend is a straight line with a positive slope, and the seasonality could be isolated very well from the series exhibiting both regular frequency (with a cycle number of 4) and amplitude.
Then, a multiplicative decomposition was performed as well on the series:
chart_title = 'Multiplicative Decomposition for WALMEX Net Sales'
decomposition = seasonal_decompose(y, model='multiplicative')
decomposition.plot()
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png', bbox_inches = 'tight')
plt.show()
As the residuals in the multiplicative decomposition show an uniform value, this is an indication that the decomposition failed to properly capture all the information from the series. So, this decomposition approach is not reliable.
To check whether the variables shared a common stochastic trend, or more formally, whether there is a linear combination of these variables that is stable (stationary) (Kotzé, 2024); Johansen's Cointegration test was performed (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023).
The Johansen test checks the cointegration rank $r$ of a given set of vectors. The null hypothesis of $r=0$ implies no cointegration relationship; whereas a rank of $r\leq1$ means at least one cointegration relationship, a rank of $r\leq2$ means at least two cointegration relationships, etc. (Wang, 2018):
$$H_{0}: Non\text{-}cointegration$$$$H_{1}: Cointegration$$$$\alpha = 0.95$$The test was performed using the most restrictive deterministic term order that assumes a linear trend (det_order=1), and one lag (k_ar_diff=1):
def test_cointegration(df, det_order=1, k_ar_diff=1):
"""
Performs the Johansen's Cointegration test and returns the results.
Parameters:
df (pandas.dataframe): Time series data for the cointegration test
det_order (int): The order of the deterministic terms:
-1: No constant or trend.
0: Constant term only.
1: Constant and trend terms.
k_ar_diff (int): The number of lags.
Returns:
results (pandas.dataframe): Results from the Johansen's Cointegration test.
"""
coint_test_result = coint_johansen(endog=df, det_order=det_order, k_ar_diff=k_ar_diff)
trace_stats = coint_test_result.trace_stat
trace_stats_crit_vals = coint_test_result.trace_stat_crit_vals
rank_list_int = list(range(0, df.shape[1]))
rank_list_str = [str(x) for x in rank_list_int]
rank_list_prefix = ['r = ']
rank_list_prefix = rank_list_prefix + (['r <= ',] * (len(rank_list_str) - 1))
rank_list = []
for i in range(len(rank_list_int)):
rank_list.append(rank_list_prefix[i] + rank_list_str[i])
result_dict = {'Rank':rank_list,
'CL 90%':trace_stats_crit_vals[:,0],
'CL 95%':trace_stats_crit_vals[:,1],
'CL 99%':trace_stats_crit_vals[:,2],
'Trace Statistic': trace_stats}
results = pd.DataFrame(result_dict)
return results
results = test_cointegration(df=X, det_order=1, k_ar_diff=1)
results
| Rank | CL 90% | CL 95% | CL 99% | Trace Statistic | |
|---|---|---|---|---|---|
| 0 | r = 0 | 133.7852 | 139.2780 | 150.0778 | 195.598921 |
| 1 | r <= 1 | 102.4674 | 107.3429 | 116.9829 | 135.936953 |
| 2 | r <= 2 | 75.1027 | 79.3422 | 87.7748 | 86.363668 |
| 3 | r <= 3 | 51.6492 | 55.2459 | 62.5202 | 49.849412 |
| 4 | r <= 4 | 32.0645 | 35.0116 | 41.0815 | 27.757403 |
| 5 | r <= 5 | 16.1619 | 18.3985 | 23.1485 | 13.290473 |
| 6 | r <= 6 | 2.7055 | 3.8415 | 6.6349 | 4.501931 |
Thus, it is possible to easily reject the null hypothesis of $r=0$, as the trace statistic is above the critical values at the 90%, 95%, and 99% confidence levels. However, the test results above suggested that there is up to two cointegration relationships a the 95% confidence level. So, it is not possible to state that all the variables are cointegrated.
To find the two cointegration relationships, from the relationships and correlations sections, it was clear that the variables units, SP&500, and WALMEX were strongly correlated. So, the cointegration test was performed using only those series.
results = test_cointegration(df=X[['units', 'sp500', 'walmex']], det_order=1, k_ar_diff=1)
results
| Rank | CL 90% | CL 95% | CL 99% | Trace Statistic | |
|---|---|---|---|---|---|
| 0 | r = 0 | 32.0645 | 35.0116 | 41.0815 | 31.360790 |
| 1 | r <= 1 | 16.1619 | 18.3985 | 23.1485 | 14.621989 |
| 2 | r <= 2 | 2.7055 | 3.8415 | 6.6349 | 4.138262 |
From the results above, it is possible to reject the null hypothesis of $r=0$, as the trace statistic for $r<=2$ is above the critical value at the 95% confidence level.
Therefore, the VAR model was later built with the above-mentioned series.
Firstly, ten univariate time series models were built to predict the net sales of WALMEX:
Then, three vector models were created and trained in Python 3 and its libraries Statsmodels and Darts to predict the values of the selected macroeconomic indicators as a multivariate time series:
After that, two regression models were built using Random Forests and Support Vector Machines in Python 3 and its library Scikit-learn to predict WALMEX total sales based on the predictions for the best performing multivariate time series model.
On the other hand, the models were assessed using Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and the Coefficient of Determination ($r^{2}$).
For reporting purposes, the model assessment plots were shown in the original scale of the WALMEX net sales data.
Finally, for sake of clarity, each model was described separately in the present section.
# Function to calculate the RMSE, MAE and Coefficient of Determination
def calculate_scores(predictions, actuals=pd.DataFrame(y[(int(0.8*(len(y)))):], columns=["net_sales"])):
"""
Calculates the RMSE, the MAE and the Coefficient of Determination for a given set of predictions.
Parameters:
predictions (pandas.dataframe): Time series predictions for testing period.
actuals (pandas.dataframe): Time series actual values for testing period.
Returns:
rmse (list): Root Mean Squared Error of the predictions.
mae (list): Mean Absolute Error of the predictions.
coeff_det (list): Coefficient of determination (r2) of the predictions.
"""
rmse = []
mae = []
coeff_det = []
if len(actuals) == len(predictions):
for i in range(0, actuals.shape[1]):
print(f'{actuals.columns[i]}')
rmse_value = np.sqrt(mean_squared_error(actuals.iloc[:,i].values, predictions.iloc[:,i].values))
mae_value = mean_absolute_error(actuals.iloc[:,i].values, predictions.iloc[:,i].values)
coeff_det_value = r2_score(actuals.iloc[:,i].values, predictions.iloc[:,i].values)
print(f'RMSE: {rmse_value:.3f}')
print(f'MAE: {mae_value:.3f}')
print(f'Coefficient of Determination: {coeff_det_value:.3f}\n')
rmse.append(rmse_value)
mae.append(mae_value)
coeff_det.append(coeff_det_value)
return rmse, mae, coeff_det
else:
print('Number of features is different between the testing set and the predictions set.')
# Function to plot the train, test and predictions for a given model
def plot_predictions(y_pred, model_name, y_train=y[:(int(0.8*(len(y)))) + 1], y_test=y[(int(0.8*(len(y)))):]):
"""
Plots the train, test and predictions sets for a given model.
Parameters:
y_train (pandas.dataframe): Training dataset
y_test (pandas.dataframe): Testing dataset
y_pred (pandas.dataframe): Predictions from the model
model_name (str): Name of the model for the plot title.
Returns:
None
"""
chart_title = 'Predictions from ' + model_name + ' vs. WALMEX Historical Net Sales'
# Plots
fig, ax = plt.subplots(figsize=(8,5))
sns.lineplot(y_train, ax=ax, zorder=2)
sns.lineplot(y_test, ax=ax, color='green', linestyle='dashed', zorder=1)
sns.lineplot(y_pred, ax=ax, color = pred_color, lw=2, zorder=3)
# Adding shadow to predictions
first_pred_x = y_test.index[0]
last_pred_x = y_test.index[-1]
ax.axvspan(first_pred_x, last_pred_x, color='#808080', alpha=0.2)
# Adding legend to plot
legend_lines = [Line2D([0], [0], color=time_series_color, lw=2),
Line2D([0], [0], color='green', lw=2, linestyle='dashed'),
Line2D([0], [0], color=pred_color, lw=2)]
plt.legend(legend_lines,
['Training', 'Testing', 'Predictions'],
loc='upper left',
facecolor='white',
frameon = True)
# Adding labels
plt.title(chart_title)
plt.ylabel('Net Sales (mdp)')
plt.xlabel('Time')
# Adjusting Y ticks to Currency format
ticks = ax.get_yticks()
new_labels = [f'${int(i):,.0f}' for i in ticks]
ax.set_yticklabels(new_labels)
plt.savefig(f'Images/fig_{chart_title.lower().replace(".", "").replace(" ", "_")}.png', bbox_inches = 'tight')
plt.show()
# Creating list to store the scores from each univariate time series model
univariate_models_list = ['MA', 'AR', 'AR w/Additive Decomp.', 'ARMA', 'ARIMA', 'SARIMA', 'SARIMAX', 'SES', 'HW', 'Prophet']
rmse_list = []
mae_list = []
r2_list = []
# Creating lists to store the scores from multivariate time series models
multivariate_models_list = ['VAR', 'VARMA', 'VARIMA']
X_rmse_list = []
X_mae_list = []
X_r2_list = []
# Creating lists to store the scores from regression models
regression_models_list = ['RF', 'SVR']
reg_rmse_list = []
reg_mae_list = []
reg_r2_list = []
A Moving Average (MA) model was built as a baseline model. As suggested by the name of this technique, the MA model calculates the moving average changes over time for a certain variable using a specified window lenght (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023). This technique has the advantage of removing random fluctuations and smoothen the data.
The MA model is denoted as MA($q$), where $q$ is the number of past error terms that affects the present value of the time series (Peixeiro, 2022).
The model was built using the library pandas in Python.
It was assumed that the current values are linearly dependent on the mean of the series, the current and past error terms; the errors are normally distributed (Peixeiro, 2022).
On the other hand, from the ACF plot in the EDA section, it was assumed that a window of 5 was enough to capture the overall trend of the data, as the ACF plot showed that lag 5 was the last significant coefficient.
The dataset was split into a training and a testing sets, allocating $80\%$ and $20\%$ of the data, respectively.
time_series = y
cutoff = int(0.8*(len(time_series)))
y_train = time_series[:cutoff + 1]
y_test = time_series[cutoff:]
# Window length
ma_window = 5
# Calculating the moving average
y_pred_ma_model = y.rolling(ma_window).mean()
# Narrowing down the moving average calculations to the y_test time period
y_pred_ma_model = y_pred_ma_model[len(y_train) - 1:]
y_pred_ma_model
date 2022-01-01 183352.2 2022-04-01 188297.0 2022-07-01 192885.6 2022-10-01 205224.6 2023-01-01 203532.2 2023-04-01 208683.2 2023-07-01 212158.8 2023-10-01 223278.8 Name: net_sales, dtype: float64
The predictions were plot against the historical net sales data to visually assess the performance of the MA model.
plot_predictions(y_pred_ma_model, 'MA Model')
In view of the plot above, the MA model was able to capture the trend of the time series but not their stationality.
Then, the RMSE, MAE, and $\bf{r^{2}}$ score were calculated as follows:
ma_rmse, ma_mae, ma_r2 = calculate_scores(pd.DataFrame(y_pred_ma_model))
net_sales RMSE: 15216.901 MAE: 9651.900 Coefficient of Determination: 0.465
# Adding scores to lists
rmse_list.append(ma_rmse[0])
mae_list.append(ma_mae[0])
r2_list.append(ma_r2[0])
As suggested by the ACF plot in the EDA, a Autoregressive (AR) model was built to forecast the net sales of WALMEX.
An autoregressive model implies a regression of a variable against itself, which means that the present value of a given point in a time series is linearly dependent on its past values (Peixeiro, 2022).
An AR model is denoted as AR($p$), where the order $p$ determines the number of past values that affect the present value (Peixeiro, 2022).
The model was built using the library statmodels in Python.
It was assumed that the current values are linearly dependent on their past values (Peixeiro, 2022). Moreover, it was assumed that the time series was stationary, it was not a random walk, and seasonality effects were not relevant for modeling.
The dataset was split into a training and a testing sets, allocating $80\%$ and $20\%$ of the data, respectively.
It is important to note that the second-order differenced net sales time series was used for modeling, as AR models require that the time series be stationary.
time_series = y_diff2
cutoff = int(0.8*(len(time_series)))
y_train = time_series[:cutoff]
y_test = time_series[cutoff + 1:]
Then, the order for each AR Model was identified by using the partial autocorrelation function (PACF) plot to assess the effect of past data (the so-called lags) on future data (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023).
# Trend Component
chart_title = 'Partial Autocorrelation for Second-Order Differenced Net Sales'
pacf = plot_pacf(y_train, lags=10)
plt.title(chart_title)
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png', bbox_inches = 'tight')
So, the order of the AR model for the trend component is 3.
ar_model = AutoReg(y_diff2, lags=3).fit()
# Summary for the AR Model
print(ar_model.summary())
AutoReg Model Results
==============================================================================
Dep. Variable: Net Sales No. Observations: 38
Model: AutoReg(3) Log Likelihood -350.623
Method: Conditional MLE S.D. of innovations 5425.504
Date: Thu, 12 Sep 2024 AIC 711.246
Time: 09:48:39 BIC 719.023
Sample: 04-01-2015 HQIC 713.931
- 10-01-2023
================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------
const 142.5663 918.068 0.155 0.877 -1656.815 1941.947
Net Sales.L1 -1.0230 0.037 -27.670 0.000 -1.095 -0.951
Net Sales.L2 -1.0104 0.047 -21.417 0.000 -1.103 -0.918
Net Sales.L3 -1.0203 0.038 -27.187 0.000 -1.094 -0.947
Roots
=============================================================================
Real Imaginary Modulus Frequency
-----------------------------------------------------------------------------
AR.1 -0.9838 -0.0000j 0.9838 -0.5000
AR.2 -0.0033 -0.9981j 0.9981 -0.2505
AR.3 -0.0033 +0.9981j 0.9981 0.2505
-----------------------------------------------------------------------------
# Defining starting and ending points
start_point = len(y_diff2) - len(y_test) - 1
end_point = start_point + len(y_test)
# Getting predictions for each component
y_pred_ar_model = ar_model.predict(start=start_point, end=end_point, dynamic=False)
y_pred_ar_model
2022-01-01 -65160.762579 2022-04-01 32011.282594 2022-07-01 -7178.561389 2022-10-01 37457.043271 2023-01-01 -67971.549635 2023-04-01 40830.037800 2023-07-01 -6082.944499 2023-10-01 42060.452083 Freq: QS-OCT, dtype: float64
Finally, the differencing transformation was reversed twice in order to bring the calculated predictions back to scale of the data.
To do so, the cumulative sum of the predictions was calculated and add it to the correspondent last value of the training set in the first-order differenced time series to reverse the transformation from a second-order to a first-order one.
Then, the cumulative sum of the time series above was calculated and add it to the correspondent last value of the training set in the original time series to reverse the transformation from the first-order differencing to the original scale.
# Reverting to first-order differenced net sales
y_pred_ar_model = y_diff.loc['2021-10-01'].values + y_pred_ar_model.cumsum()
y_pred_ar_model
2022-01-01 -26674.762579 2022-04-01 5336.520015 2022-07-01 -1842.041374 2022-10-01 35615.001897 2023-01-01 -32356.547738 2023-04-01 8473.490062 2023-07-01 2390.545563 2023-10-01 44450.997646 Freq: QS-OCT, dtype: float64
# Reverting to original scale from net sales
y_pred_ar_model = y.loc['2021-10-01'] + y_pred_ar_model.cumsum()
y_pred_ar_model
2022-01-01 186388.237421 2022-04-01 191724.757436 2022-07-01 189882.716061 2022-10-01 225497.717958 2023-01-01 193141.170220 2023-04-01 201614.660282 2023-07-01 204005.205846 2023-10-01 248456.203492 Freq: QS-OCT, dtype: float64
The predictions were plot against the historical net sales data to visually assess the performance of the AR model.
plot_predictions(y_pred_ar_model, 'AR Model')
Then, the RMSE, MAE, and $\bf{r^{2}}$ score were calculated as follows:
ar_rmse, ar_mae, ar_r2 = calculate_scores(pd.DataFrame(y_pred_ar_model))
net_sales RMSE: 7687.806 MAE: 6558.916 Coefficient of Determination: 0.863
# Adding scores to lists
rmse_list.append(ar_rmse[0])
mae_list.append(ar_mae[0])
r2_list.append(ar_r2[0])
A set of Autoregressive (AR) Models based on the Additive Decomposition of the WALMEX net sales time series were built.
The Additive Decomposition was deemed as an appropiate decomposition technique as the EDA suggested that the WALMEX net sales behave in a linear fashion, with constant changes over time, and with a regular seasonality with equal frequency and amplitude (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023).
Then, a set of Autoregressive Models were built for each component of the time series obtained via the decomposition: trend, seasonality, and remainder. The net sales were forecasted by adding up the predictions from each AR model (Jamieson, 2019).
Likewise, the library statsmodels in Python was used to perform the Additive Decomposition and build the AR Models.
It is assumed that the components in the net sales time series can be added together as their changes over time are regular, the time trend is a straight line, and the seasonality has the same frequency and amplitude (Brownlee, 2020).
Moreover, it is assumed that each component of the time series can be predicted by means of a linear combination of their historical or lag values (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023).
The dataset was split into a training and a testing sets, allocating $80\%$ and $20\%$ of the data, respectively.
time_series = y
cutoff = int(0.8*(len(time_series)))
y_train = time_series[:cutoff + 1]
y_test = time_series[cutoff:]
As showed in the EDA, WALMEX net sales adjusted very well to an additive decomposition. So the training set was decomposed using the Additive model with the class seasonal_decompose from the statsmodels library.
chart_title = 'Additive Decomposition for WALMEX Net Sales Training Set'
decomposition = seasonal_decompose(y_train, model='additive')
decomposition.plot()
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png', bbox_inches = 'tight')
plt.show()
# Creating variables for each component
trend = decomposition.trend.dropna()
seasonality = decomposition.seasonal.dropna()
remainder = decomposition.resid.dropna()
However, the AR model require the time series to be stationary, so the trend was differenced several times until the ADF test indicated that no unit root was found in the data.
trend_diff = trend.diff().dropna()
trend_diff2 = trend_diff.diff().dropna()
trend_diff3 = trend_diff2.diff().dropna()
stationarity_test = adfuller(trend_diff3, autolag='AIC')
print("ADF statistic: ", stationarity_test[0])
print("P-value: ", stationarity_test[1])
ADF statistic: -5.065106267573506 P-value: 1.64698660737564e-05
As the $p\text{-}value > 0.05$ then the null hypothesis is rejected. So, the third-order differenced time series is stationary.
Then, the order for each AR Model was identified by using the partial autocorrelation function (PACF) plot to assess the effect of past data (the so-called lags) on future data (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023).
# Trend Component
chart_title = 'Partial Autocorrelation for Trend Component'
trend_pacf = plot_pacf(trend_diff3, lags=10)
plt.title(chart_title)
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png', bbox_inches = 'tight')
plt.show()
So, the order of the AR model for the trend component is 1.
# Seasonality Component
chart_title = 'Partial Autocorrelation for Seasonality Component'
seasonality_pacf = plot_pacf(seasonality, lags=10)
plt.title(chart_title)
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png', bbox_inches = 'tight')
plt.show()
So, the order of the AR model for the seasonality component is 3.
# Remainder Component
chart_title = 'Partial Autocorrelation for Remainder Component'
remainder_pacf = plot_pacf(remainder, lags=10)
plt.title(chart_title)
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png', bbox_inches = 'tight')
plt.show()
So, the order of the AR model for the remainder component is 1.
Then, a AR model was fitted for each component from the additive decomposition: trend, seasonality, and remainder using the AutoReg class from statsmodels.
trend_ar_model = AutoReg(trend_diff3, lags=1).fit()
seasonality_ar_model = AutoReg(seasonality, lags=3).fit()
remainder_ar_model = AutoReg(remainder, lags=1).fit()
# Summary for the AR Model for Trend component
print(trend_ar_model.summary())
AutoReg Model Results
==============================================================================
Dep. Variable: trend No. Observations: 26
Model: AutoReg(1) Log Likelihood -210.538
Method: Conditional MLE S.D. of innovations 1099.471
Date: Thu, 12 Sep 2024 AIC 427.076
Time: 09:48:56 BIC 430.733
Sample: 07-01-2015 HQIC 428.090
- 07-01-2021
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 8.6672 220.132 0.039 0.969 -422.783 440.117
trend.L1 -0.1201 0.200 -0.600 0.549 -0.512 0.272
Roots
=============================================================================
Real Imaginary Modulus Frequency
-----------------------------------------------------------------------------
AR.1 -8.3290 +0.0000j 8.3290 0.5000
-----------------------------------------------------------------------------
# Summary for the AR Model for Seasonality component
print(seasonality_ar_model.summary())
AutoReg Model Results
==============================================================================
Dep. Variable: seasonal No. Observations: 33
Model: AutoReg(3) Log Likelihood 749.516
Method: Conditional MLE S.D. of innovations 0.000
Date: Thu, 12 Sep 2024 AIC -1489.032
Time: 09:48:56 BIC -1482.026
Sample: 10-01-2014 HQIC -1486.790
- 01-01-2022
===============================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------
const 4.832e-12 6.25e-13 7.732 0.000 3.61e-12 6.06e-12
seasonal.L1 -1.0000 6.66e-17 -1.5e+16 0.000 -1.000 -1.000
seasonal.L2 -1.0000 6.69e-17 -1.49e+16 0.000 -1.000 -1.000
seasonal.L3 -1.0000 6.9e-17 -1.45e+16 0.000 -1.000 -1.000
Roots
=============================================================================
Real Imaginary Modulus Frequency
-----------------------------------------------------------------------------
AR.1 -1.0000 -0.0000j 1.0000 -0.5000
AR.2 0.0000 -1.0000j 1.0000 -0.2500
AR.3 0.0000 +1.0000j 1.0000 0.2500
-----------------------------------------------------------------------------
# Summary for the AR Model for Remainder component
print(remainder_ar_model.summary())
AutoReg Model Results
==============================================================================
Dep. Variable: resid No. Observations: 29
Model: AutoReg(1) Log Likelihood -255.912
Method: Conditional MLE S.D. of innovations 2254.688
Date: Thu, 12 Sep 2024 AIC 517.823
Time: 09:48:57 BIC 521.820
Sample: 10-01-2014 HQIC 519.045
- 07-01-2021
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -229.7760 426.350 -0.539 0.590 -1065.407 605.855
resid.L1 -0.1896 0.195 -0.975 0.330 -0.571 0.192
Roots
=============================================================================
Real Imaginary Modulus Frequency
-----------------------------------------------------------------------------
AR.1 -5.2739 +0.0000j 5.2739 0.5000
-----------------------------------------------------------------------------
Predictions from each AR model were obtained for the lenght of the testing set.
However, as the trend predictions were transformed to be modeled, their predictions were brought back to the original scale of the data by adding up the last value of the training set and calculating the cummulative sum of the predictions. This process was performed three times as the transformed trend time series was of third order.
Finally, the net sales forecast was built by recomposing (adding up) each of the predictions from the AR models into a single time series.
# Defining starting and ending points
start_point = len(y) - len(y_test) - 2
end_point = start_point + len(y_test) - 1
# Getting predictions for each component
trend_predictions = trend_ar_model.predict(start=start_point - 3, end=end_point - 3, dynamic=False)
seasonality_predictions = seasonality_ar_model.predict(start=start_point + 2, end=end_point + 2, dynamic=False)
remainder_predictions= remainder_ar_model.predict(start=start_point, end=end_point, dynamic=False)
# De-transforming trend predictions to original scale
trend_predictions = trend_diff2.loc['2021-07-01'] + trend_predictions.cumsum()
trend_predictions = trend_diff.loc['2021-07-01'] + trend_predictions.cumsum()
trend_predictions = trend.loc['2021-07-01'] + trend_predictions.cumsum()
# Recomposing component predictions into a single time series
y_pred_ar_models = trend_predictions + seasonality_predictions + remainder_predictions
y_pred_ar_models
2022-01-01 183595.994447 2022-04-01 188927.735192 2022-07-01 196410.599671 2022-10-01 230031.384667 2023-01-01 213189.781651 2023-04-01 221794.739136 2023-07-01 232762.447850 2023-10-01 269864.772322 Freq: QS-OCT, dtype: float64
Likewise, the predictions were plot against the historical net sales data to visually assess the performance of the AR models with additive decomposition.
plot_predictions(y_pred_ar_models, 'AR Models With Additive Decomposition')
In view of the plot above, the additive decomposition and the AR models were able to provide closer predictions for the WALMEX net sales.
Then, the RMSE, MAE, and $\bf{r^{2}}$ score were calculated as follows:
ar_decomp_rmse, ar_decomp_mae, ar_decomp_r2 = calculate_scores(pd.DataFrame(y_pred_ar_models))
net_sales RMSE: 11272.204 MAE: 8970.403 Coefficient of Determination: 0.706
# Adding scores to lists
rmse_list.append(ar_decomp_rmse[0])
mae_list.append(ar_decomp_mae[0])
r2_list.append(ar_decomp_r2[0])
An Autoregressive Moving Average (ARMA) model based on the WALMEX net sales time series was built.
The ARMA model is a combination of the autoregressive process and the moving average process, and it is denoted as ARMA($p$,$q$), where $p$ is the order of the autoregressive process, and $q$ is the order of the moving average process. The order of $p$ determines the number of past values that affect the present value; whereas the order of $q$ determines the number of past error terms that affect the present value (Peixeiro, 2022).
Indeed, the ACF and PACF plots shown earlier suggested that the time series was neither a pure moving average or a pure autorregressive process as sinusoidal patterns were found and no clear cut-off values between significant and no significant lags were identified. So, the ARMA model could yield better prediction results (Peixeiro, 2022).
Several orders $p$ and $q$, for the autoregressive and moving average portions of the model, respectively, were tested and the best performing model was selected according to the Akaike Information Criterion (AIC) and a residual analysis (Peixeiro, 2022).
Likewise, the function ARIMA from the library statsmodels in Python was used to build the ARMA model (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023).
It is assumed that the present values in the time series are linearly dependend on both its past values and on the mean, the current error and the past errors terms (Peixeiro, 2022).
Another key assumption is that the dataset was stationary when performing a second-order differencing.
Moreover, it was assumed that the residuals from the model were normally distributed, independent, and uncorrelated; approximanting to white noise (Peixeiro, 2022).
The dataset was split into a training and a testing sets, allocating $80\%$ and $20\%$ of the data, respectively.
time_series = y_diff2
cutoff = int(0.8*(len(time_series)))
y_train = time_series[:cutoff]
y_test = time_series[cutoff + 1:]
In this case, as several values for $p$ and $q$ were tested, a function to fit several ARMA models and calculate their corresponding AIC score was defined.
def ARMA_model(p, q, time_series):
"""
Fits and optimize an autoregressive moving average (ARMA) model given a set of p and q values, minimizing
the Akaike Information Criterion (AIC).
Parameters:
p (range): Range for order p in the autoregressive portion of the ARMA model.
q (range): Range for order q in the moving average portion of the ARMA model.
time_series (pandas.series): Time series data for fitting the ARMA model.
Returns:
ARMA_model (statsmodels.arima): An ARMA model object fitted according to the combination of p and q that minimizes
the Akaike Information Criterion (AIC).
"""
# Obtaining the combinations of p and q
order_list = list(product(p, q))
# Creating emtpy lists to store results
order_results = []
aic_results = []
# Fitting models
for order in order_list:
ARMA_model = ARIMA(time_series, order = (order[0], 0, order[1])).fit()
order_results.append(order)
aic_results.append(ARMA_model.aic)
# Converting lists to dataframes
results = pd.DataFrame({'(p,q)': order_results,
'AIC': aic_results
})
# Storing results from the best model
lowest_aic = results.AIC.min()
best_model = results.loc[results['AIC'] == lowest_aic, ['(p,q)']].values[0][0]
# Printing results
print(f'The best model is (p = {best_model[0]}, q = {best_model[1]}), with an AIC of {lowest_aic:.02f}.\n')
print(results)
# Fitting best model again
ARMA_model = ARIMA(time_series, order = (best_model[0], 0, best_model[1])).fit()
return ARMA_model
Then, several ARMA models were fitted using the previously identified significant lags plus one according to the ACF and PACF plots:
arma_model = ARMA_model(p=range(1,6), q=range(1,7), time_series=y_train)
The best model is (p = 4, q = 5), with an AIC of 634.21.
(p,q) AIC
0 (1, 1) 689.059291
1 (1, 2) 689.249985
2 (1, 3) 690.762092
3 (1, 4) 690.935897
4 (1, 5) 687.504766
5 (1, 6) 665.427105
6 (2, 1) 710.716053
7 (2, 2) 690.696598
8 (2, 3) 691.233741
9 (2, 4) 692.495432
10 (2, 5) 689.913298
11 (2, 6) 672.768398
12 (3, 1) 638.414862
13 (3, 2) 636.733963
14 (3, 3) 635.917669
15 (3, 4) 634.837998
16 (3, 5) 638.067682
17 (3, 6) 676.598570
18 (4, 1) 635.632338
19 (4, 2) 636.837116
20 (4, 3) 639.042331
21 (4, 4) 637.188384
22 (4, 5) 634.211084
23 (4, 6) 1854.993302
24 (5, 1) 637.444485
25 (5, 2) 637.908936
26 (5, 3) 638.939437
27 (5, 4) 637.940815
28 (5, 5) 634.471567
29 (5, 6) 1754.478023
So, the model ARMA(4, 5) was found to be the best model relative to all others models fit.
arma_model.summary()
| Dep. Variable: | Net Sales | No. Observations: | 30 |
|---|---|---|---|
| Model: | ARIMA(4, 0, 5) | Log Likelihood | -306.106 |
| Date: | Thu, 12 Sep 2024 | AIC | 634.211 |
| Time: | 09:49:15 | BIC | 649.624 |
| Sample: | 07-01-2014 | HQIC | 639.142 |
| - 10-01-2021 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 1219.6968 | 2549.815 | 0.478 | 0.632 | -3777.849 | 6217.243 |
| ar.L1 | -0.0401 | 0.270 | -0.149 | 0.882 | -0.569 | 0.489 |
| ar.L2 | -0.0458 | 0.236 | -0.194 | 0.846 | -0.508 | 0.417 |
| ar.L3 | -0.0438 | 0.218 | -0.201 | 0.841 | -0.471 | 0.384 |
| ar.L4 | 0.9518 | 0.226 | 4.211 | 0.000 | 0.509 | 1.395 |
| ma.L1 | -0.9058 | 0.794 | -1.141 | 0.254 | -2.461 | 0.650 |
| ma.L2 | 0.1759 | 0.737 | 0.239 | 0.811 | -1.269 | 1.621 |
| ma.L3 | -0.1859 | 1.115 | -0.167 | 0.868 | -2.371 | 1.999 |
| ma.L4 | -0.5263 | 0.626 | -0.840 | 0.401 | -1.754 | 0.701 |
| ma.L5 | 0.6863 | 0.689 | 0.996 | 0.319 | -0.665 | 2.037 |
| sigma2 | 3.726e+07 | 0.023 | 1.59e+09 | 0.000 | 3.73e+07 | 3.73e+07 |
| Ljung-Box (L1) (Q): | 0.51 | Jarque-Bera (JB): | 1.00 |
|---|---|---|---|
| Prob(Q): | 0.47 | Prob(JB): | 0.61 |
| Heteroskedasticity (H): | 1.82 | Skew: | 0.43 |
| Prob(H) (two-sided): | 0.36 | Kurtosis: | 2.73 |
The purpose of the residual analysis is to assess whether the difference between the actual and predicted values of the model is due to randomness or, in other words, that the residuals are normally and independently distributed (Peixeiro, 2022).
To do so, the residual analysis is performed in two steps (Peixeiro, 2022):
# Storing residuals
residuals = arma_model.resid
# Plotting QQ-Plot
chart_title = 'Diagnostic plots for standardized residuals from ARMA model'
arma_model.plot_diagnostics(figsize=(10, 8))
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png', bbox_inches = 'tight')
plt.show()
From the diagnostics plots, the residuals are not completely normally distributed. However, it was considered that the results were good enough for the purposes of the present study, as other more sophisticated models were fit below.
Then, the residuals were assessed by testing for uncorrelation with the Ljung-Box test (Peixeiro, 2022), using a 95% confidence level:
$$H_{0}: No\text{-}autocorrelation$$$$H_{1}: Autocorrelation$$$$\alpha = 0.95$$# Ljung-Box test of the residuals on 10 lags
acorr_ljungbox(residuals, np.arange(1, 11))
| lb_stat | lb_pvalue | |
|---|---|---|
| 1 | 3.271269 | 0.070503 |
| 2 | 3.401430 | 0.182553 |
| 3 | 3.420409 | 0.331233 |
| 4 | 3.519843 | 0.474868 |
| 5 | 3.689612 | 0.594911 |
| 6 | 3.763542 | 0.708639 |
| 7 | 4.246189 | 0.751025 |
| 8 | 7.457154 | 0.488205 |
| 9 | 7.726973 | 0.561878 |
| 10 | 8.526754 | 0.577525 |
For each of the lags from 1 to 10, the $p\text{-}values$ were above $0.05$. Thus, the null hypothesis cannot be rejected and no autocorrelation was found on the set of residuals from the ARMA model. This means that the residuals are independently distributed and the model can be used for forecasting.
# Getting predictions from ARMA model at a confidence level of 95%.
predictions = arma_model.get_forecast(len(y_test) + 1)
predictions_ci = predictions.conf_int(alpha = 0.05)
predictions_ci
| lower Net Sales | upper Net Sales | |
|---|---|---|
| 2022-01-01 | -73772.743621 | -49648.092767 |
| 2022-04-01 | 10567.440730 | 43664.817097 |
| 2022-07-01 | -18367.879147 | 14921.241361 |
| 2022-10-01 | 15502.807909 | 49082.785082 |
| 2023-01-01 | -72873.369379 | -37432.758688 |
| 2023-04-01 | 8887.030408 | 44779.177577 |
| 2023-07-01 | -19407.430067 | 16633.904872 |
| 2023-10-01 | 14028.399318 | 50358.751266 |
# Getting predicted mean from confidence intervals from ARMA model
y_pred_arma_model = arma_model.predict(start = predictions_ci.index[0], end = predictions_ci.index[-1])
y_pred_arma_model
2022-01-01 -61710.418194 2022-04-01 27116.128914 2022-07-01 -1723.318893 2022-10-01 32292.796496 2023-01-01 -55153.064034 2023-04-01 26833.103992 2023-07-01 -1386.762597 2023-10-01 32193.575292 Freq: QS-OCT, Name: predicted_mean, dtype: float64
# Calculating predictions at the original scale of WALMEX net sales
y_pred_arma_model = y_diff.loc[y_train.index[-1]].values + y_pred_arma_model.cumsum()
y_pred_arma_model = y.loc[y_train.index[-1]] + y_pred_arma_model.cumsum()
y_pred_arma_model
2022-01-01 189838.581806 2022-04-01 193730.292526 2022-07-01 195898.684352 2022-10-01 230359.872675 2023-01-01 209667.996963 2023-04-01 215809.225244 2023-07-01 220563.690928 2023-10-01 257511.731903 Freq: QS-OCT, Name: predicted_mean, dtype: float64
The predictions were plot against the historical net sales data to visually assess the performance of the ARMA model.
plot_predictions(y_pred_arma_model, 'ARIMA Model')
In view of the plot above, the ARMA model was able to provide very close predictions for the WALMEX net sales.
Then, the RMSE, MAE, and $\bf{r^{2}}$ score were calculated as follows:
arma_rmse, arma_mae, arma_r2 = calculate_scores(pd.DataFrame(y_pred_arma_model))
net_sales RMSE: 5006.673 MAE: 4190.297 Coefficient of Determination: 0.942
# Adding scores to lists
rmse_list.append(arma_rmse[0])
mae_list.append(arma_mae[0])
r2_list.append(arma_r2[0])
An Autoregressive Integrated Moving Average (ARIMA) model based on the WALMEX net sales time series was built.
The ARIMA is a combination of the autoregressive process, integration and the moving average process for forecasting of non-stationary time series, meaning that the time series has a trend, and/or its variance is not constant. (Peixeiro, 2022). It is denoted as ARIMA($p$,$d$, $q$); where $p$ is the order of the autoregressive process, $d$ is the integration order, and $q$ is the order of the moving average process. The order of $p$ determines the number of past values that affect the present value, the order of $q$ determines the number of past error terms that affect the present value, and the order of integration $d$ indicates the number of times a time series has been differenced to become stationary (Peixeiro, 2022).
Similar to the building of the ARMA model, several orders $p$ and $q$, for the autoregressive and moving average portions of the model, respectively, were tested and the best performing model was selected according to the Akaike Information Criterion (AIC) and a residual analysis (Peixeiro, 2022).
The function ARIMA from the library statsmodels in Python was used to build the ARIMA model (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023).
It is assumed that the present values in the time series are linearly dependend on both its past values and on the mean, the current error and the past errors terms (Peixeiro, 2022).
It was also assumed that the residuals from the model were normally distributed, independent, and uncorrelated; approximanting to white noise (Peixeiro, 2022).
On the other hand, from the EDA, it was determined that the order of integration for the WALMEX net sales is $d = 2$.
The dataset was split into a training and a testing sets, allocating $80\%$ and $20\%$ of the data, respectively.
time_series = y
cutoff = int(0.8*(len(time_series)))
y_train = time_series[:cutoff]
y_test = time_series[cutoff + 1:]
As several values for $p$ and $q$ were tested, a function to fit several ARIMA models and calculate their corresponding AIC score was defined. However, unlike the ARMA function, a new parameter $d$ was included in the ARIMA function.
def ARIMA_model(p, d, q, time_series):
"""
Fits and optimize an autoregressive integrated moving average (ARIMA) model based on the Akaike
Information Criterion (AIC), given a set of p and q values; while keeping the d order constant.
Parameters:
p (range): Range for order p in the autoregressive portion of the ARIMA model.
d (int): Integration order.
q (range): Range for order q in the moving average portion of the ARIMA model.
time_series (pandas.series): Time series data for fitting the ARIMA model.
Returns:
ARIMA_model (statsmodels.arima): An ARIMA model object fitted according to the combination of p and q that minimizes
the Akaike Information Criterion (AIC).
"""
# Obtaining the combinations of p and q
order_list = list(product(p, q))
# Creating emtpy lists to store results
order_results = []
aic_results = []
# Fitting models
for order in order_list:
ARIMA_model = ARIMA(time_series, order = (order[0], d, order[1])).fit()
order_results.append((order[0], d, order[1]))
aic_results.append(ARIMA_model.aic)
# Converting lists to dataframes
results = pd.DataFrame({'(p,d,q)': order_results,
'AIC': aic_results
})
# Storing results from the best model
lowest_aic = results.AIC.min()
best_model = results.loc[results['AIC'] == lowest_aic, ['(p,d,q)']].values[0][0]
# Printing results
print(f'The best model is (p = {best_model[0]}, d = {d}, q = {best_model[2]}), with an AIC of {lowest_aic:.02f}.\n')
print(results)
# Fitting best model again
ARIMA_model = ARIMA(time_series, order = (best_model[0], best_model[1], best_model[2])).fit()
return ARIMA_model
Then, several ARIMA models were fitted using the previously identified significant lags plus one according to the ACF and PACF plots:
arima_model = ARIMA_model(p=range(1,6), d=2, q=range(1,7), time_series=y_train)
The best model is (p = 3, d = 2, q = 1), with an AIC of 688.12.
(p,d,q) AIC
0 (1, 2, 1) 690.735239
1 (1, 2, 2) 693.351496
2 (1, 2, 3) 695.276194
3 (1, 2, 4) 694.769638
4 (1, 2, 5) 692.507787
5 (1, 2, 6) 696.912015
6 (2, 2, 1) 692.110715
7 (2, 2, 2) 694.979680
8 (2, 2, 3) 696.205633
9 (2, 2, 4) 696.308298
10 (2, 2, 5) 694.424715
11 (2, 2, 6) 714.181864
12 (3, 2, 1) 688.122065
13 (3, 2, 2) 695.939508
14 (3, 2, 3) 698.538911
15 (3, 2, 4) 1068.206667
16 (3, 2, 5) 931.741421
17 (3, 2, 6) 1417.464562
18 (4, 2, 1) 703.977340
19 (4, 2, 2) 787.877988
20 (4, 2, 3) 1001.862630
21 (4, 2, 4) 1056.239218
22 (4, 2, 5) 910.280141
23 (4, 2, 6) 6785.936282
24 (5, 2, 1) 690.665518
25 (5, 2, 2) 826.874334
26 (5, 2, 3) 1034.760249
27 (5, 2, 4) 1109.219729
28 (5, 2, 5) 923.357552
29 (5, 2, 6) 8630.752800
So, the model ARIMA(3, 2, 1) was found to be the best model relative to all others models fit.
arima_model.summary()
| Dep. Variable: | net_sales | No. Observations: | 32 |
|---|---|---|---|
| Model: | ARIMA(3, 2, 1) | Log Likelihood | -339.061 |
| Date: | Thu, 12 Sep 2024 | AIC | 688.122 |
| Time: | 09:49:41 | BIC | 695.128 |
| Sample: | 01-01-2014 | HQIC | 690.363 |
| - 10-01-2021 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| ar.L1 | -0.9986 | 0.513 | -1.948 | 0.051 | -2.004 | 0.006 |
| ar.L2 | -0.8087 | 0.365 | -2.217 | 0.027 | -1.524 | -0.094 |
| ar.L3 | -0.5638 | 0.129 | -4.376 | 0.000 | -0.816 | -0.311 |
| ma.L1 | 0.0512 | 0.899 | 0.057 | 0.955 | -1.711 | 1.813 |
| sigma2 | 3.413e+08 | 5.31e-09 | 6.42e+16 | 0.000 | 3.41e+08 | 3.41e+08 |
| Ljung-Box (L1) (Q): | 0.60 | Jarque-Bera (JB): | 38.39 |
|---|---|---|---|
| Prob(Q): | 0.44 | Prob(JB): | 0.00 |
| Heteroskedasticity (H): | 0.28 | Skew: | -1.54 |
| Prob(H) (two-sided): | 0.06 | Kurtosis: | 7.61 |
The purpose of the residual analysis is to assess whether the difference between the actual and predicted values of the model is due to randomness or, in other words, that the residuals are normally and independently distributed (Peixeiro, 2022).
To do so, the residual analysis is performed in two steps (Peixeiro, 2022):
# Storing residuals
residuals = arima_model.resid
# Plotting QQ-Plot
chart_title = 'Diagnostic plots for standardized residuals from ARIMA model'
arima_model.plot_diagnostics(figsize=(10, 8))
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png', bbox_inches = 'tight')
plt.show()
From the diagnostics plots, the residuals are not completely normally distributed. However, it was considered that the results were good enough for the purposes of the present study.
Then, the residuals were assessed by testing for uncorrelation with the Ljung-Box test (Peixeiro, 2022), using a 95% confidence level:
$$H_{0}: No\text{-}autocorrelation$$$$H_{1}: Autocorrelation$$$$\alpha = 0.95$$# Ljung-Box test of the residuals on 10 lags
acorr_ljungbox(residuals, np.arange(1, 11))
| lb_stat | lb_pvalue | |
|---|---|---|
| 1 | 0.996298 | 0.318208 |
| 2 | 1.272624 | 0.529241 |
| 3 | 1.287834 | 0.732024 |
| 4 | 2.389698 | 0.664490 |
| 5 | 2.507422 | 0.775377 |
| 6 | 2.584535 | 0.858889 |
| 7 | 2.584620 | 0.920591 |
| 8 | 3.390625 | 0.907511 |
| 9 | 3.550224 | 0.938451 |
| 10 | 3.915916 | 0.951060 |
For each of the lags from 1 to 10, the $p\text{-}values$ were well above $0.05$. Thus, the null hypothesis cannot be rejected, meaning that no autocorrelation was found on the set of residuals from the ARIMA model. Thus, the residuals are independently distributed and the model can be used for forecasting.
# Getting predictions from ARIMA model at a confidence level of 95%.
predictions = arima_model.get_forecast(len(y_test) + 1)
predictions_ci = predictions.conf_int(alpha = 0.05)
predictions_ci
| lower net_sales | upper net_sales | |
|---|---|---|
| 2022-01-01 | 165173.606625 | 237593.181281 |
| 2022-04-01 | 158677.779068 | 263820.459112 |
| 2022-07-01 | 149950.171598 | 288340.797115 |
| 2022-10-01 | 151955.606075 | 327783.341282 |
| 2023-01-01 | 120735.684089 | 353721.126484 |
| 2023-04-01 | 106785.634575 | 390527.087914 |
| 2023-07-01 | 88968.278221 | 426427.498843 |
| 2023-10-01 | 73308.793181 | 468528.597704 |
# Getting predicted mean from confidence intervals from ARIMA model
y_pred_arima_model = arima_model.predict(start=predictions_ci.index[0], end=predictions_ci.index[-1])
y_pred_arima_model
2022-01-01 201383.393953 2022-04-01 211249.119090 2022-07-01 219145.484356 2022-10-01 239869.473678 2023-01-01 237228.405287 2023-04-01 248656.361244 2023-07-01 257697.888532 2023-10-01 270918.695443 Freq: QS-OCT, Name: predicted_mean, dtype: float64
In this case, unlike the previous models, it was not necessary to bring the predictions back to the original scale of the series, as the ARIMA model handled that by itself.
The predictions were plot against the historical net sales data to visually assess the performance of the ARIMA model.
plot_predictions(y_pred_arima_model, 'ARIMA Model')
In view of the plot above, is seems that the ARIMA model was not able to capture the seasonality of the time series. Notably, the ARMA model was able to yield even better predictions.
Then, the RMSE, MAE, and $\bf{r^{2}}$ score were calculated as follows:
arima_rmse, arima_mae, arima_r2 = calculate_scores(pd.DataFrame(y_pred_arima_model))
net_sales RMSE: 27274.028 MAE: 24120.853 Coefficient of Determination: -0.720
# Adding scores to lists
rmse_list.append(arima_rmse[0])
mae_list.append(arima_mae[0])
r2_list.append(arima_r2[0])
A Seasonal Autoregressive Integrated Moving Average (SARIMA) model based on the WALMEX net sales time series was built.
The SARIMA model is a combination of the autoregressive process, integration and the moving average process for forecasting of non-stationary time series, but also accounting for seasonal patterns. (Peixeiro, 2022). It is denoted as $\bold{\text{SARIMA}(p,d,q)(P,D,Q)_{m}}$; where $p$ is the order of the autoregressive process, $d$ is the integration order, $q$ is the order of the moving average process, $m$ is the frequency, and $P$, $D$, and $Q$ are the orders for the seasonal parameters for the autoregressive, integration, and moving average processes, respectively (Peixeiro, 2022).
Similar to the building of the ARMA and ARIMA models, several orders $p$, $q$, $P$, and $Q$ were tested and the best performing model was selected according to the Akaike Information Criterion (AIC) and a residual analysis (Peixeiro, 2022).
The function SARIMAX from the library statsmodels in Python was used to build the SARIMA model (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023).
It is assumed that the present values in the time series are linearly dependend on both its past values and on the mean, the current error and the past errors terms (Peixeiro, 2022).
It was also assumed that the residuals from the model were normally distributed, independent, and uncorrelated; approximanting to white noise (Peixeiro, 2022).
Furthermore, as the time series is reported in a quaterly basis, the frequency or number of observations per seasonal cycle was defined as $m = 4$. This assumption was also supported by the additive decomposition of the time series carried out above.
Finally, from the EDA, it was found that the original time series was not stationary. In this sense, it was proposed to use the first-differenced time series as a basis for seasonal differencing. So, the order of integration for the WALMEX net sales was $d = 1$ for the SARIMA model.
The dataset was split into a training and a testing sets, allocating $80\%$ and $20\%$ of the data, respectively.
time_series = y
cutoff = int(0.8*(len(time_series)))
y_train = time_series[:cutoff]
y_test = time_series[cutoff + 1:]
The first-differenced time series was seasonally differenced using $m = 4$, and the stationarity was tested using ADF test.
y_season_diff = y_diff.diff(periods=4).dropna()
stationarity_test = adfuller(y_season_diff, autolag='AIC')
print("ADF statistic: ", stationarity_test[0])
print("P-value: ", stationarity_test[1])
ADF statistic: -3.0522399619417264 P-value: 0.030295538176182014
The ADF statistic is strongly negative and the $p\text{-}value < 0.05$, so the null hypothesis that the time series is not statiotionary is rejected.
In this context, the order of seasonal integration is $D = 1$.
As several values for $p$, $q$, $P$, and $Q$ were tested, a function to fit several SARIMA models and calculate their corresponding AIC score was defined.
def SARIMA_model(p, d, q, P, D, Q, m, time_series):
"""
Fits and optimize a seasonal autoregressive integrated moving average (SARIMA) model based on the Akaike
Information Criterion (AIC), given a set of p, q, P, and Q values; while keeping the d and D orders constant.
The frequency m is also kept constant.
Parameters:
p (range): Range for order p in the autoregressive portion of the SARIMA model.
d (int): Integration order.
q (range): Range for order q in the moving average portion of the SARIMA model.
P (range): Range for order P in the seasonal autoregressive portion of the SARIMA model.
D (int): Seasonal integration order.
Q (range): Range for order P in the seasonal moving average portion of the SARIMA model.
m (int): Number of observations per seasonal cycle.
time_series (pandas.series): Time series data for fitting the SARIMA model.
Returns:
SARIMA_model (statsmodels.sarimax): An SARIMAX model object fitted according to the combination of p, q, P,
and Q values that minimizes the Akaike Information Criterion (AIC).
"""
# Obtaining the combinations of p and q
order_list = list(product(p, q, P, Q))
# Creating emtpy lists to store results
order_results = []
aic_results = []
# Fitting models
for order in order_list:
SARIMA_model = SARIMAX(endog=time_series,
order = (order[0], d, order[1]),
seasonal_order=(order[2], D, order[3], m),
).fit(disp=False)
order_results.append((order[0], d, order[1], order[2], D, order[3], m))
aic_results.append(SARIMA_model.aic)
# Converting lists to dataframes
results = pd.DataFrame({'(p,d,q)(P,D,Q)m': order_results,
'AIC': aic_results
})
# Storing results from the best model
lowest_aic = results.AIC.min()
best_model = results.loc[results['AIC'] == lowest_aic, ['(p,d,q)(P,D,Q)m']].values[0][0]
# Printing results
print(f'The best model is (p = {best_model[0]}, d = {d}, q = {best_model[2]})(P = {best_model[3]}, D = {D}, Q = {best_model[5]})(m = {m}), with an AIC of {lowest_aic:.02f}.\n')
print(results)
# Fitting best model again
SARIMA_model = SARIMAX(endog=time_series,
order = (best_model[0], d, best_model[2]),
seasonal_order=(best_model[3], D, best_model[5], m),
).fit(disp=False)
return SARIMA_model
Then, several SARIMA models were fitted using the order from 1 to 4, as the time series is reported in a quarterly basis:
sarima_model = SARIMA_model(p=range(1,4),
d=1,
q=range(1,4),
P=range(1,5),
D=1,
Q=range(1,5),
m=4,
time_series=y_train)
The best model is (p = 1, d = 1, q = 1)(P = 1, D = 1, Q = 1)(m = 4), with an AIC of 555.61.
(p,d,q)(P,D,Q)m AIC
0 (1, 1, 1, 1, 1, 1, 4) 555.609844
1 (1, 1, 1, 1, 1, 2, 4) 557.206290
2 (1, 1, 1, 1, 1, 3, 4) 560.066934
3 (1, 1, 1, 1, 1, 4, 4) 561.218828
4 (1, 1, 1, 2, 1, 1, 4) 557.223439
.. ... ...
139 (3, 1, 3, 3, 1, 4, 4) 565.349865
140 (3, 1, 3, 4, 1, 1, 4) 567.793619
141 (3, 1, 3, 4, 1, 2, 4) 564.148215
142 (3, 1, 3, 4, 1, 3, 4) 565.327303
143 (3, 1, 3, 4, 1, 4, 4) 568.005094
[144 rows x 2 columns]
So, the model $\text{SARIMA}(1,1,1)(1,1,1)_{4}$ was found to be the best model relative to all others models fit.
sarima_model.summary()
| Dep. Variable: | net_sales | No. Observations: | 32 |
|---|---|---|---|
| Model: | SARIMAX(1, 1, 1)x(1, 1, 1, 4) | Log Likelihood | -272.805 |
| Date: | Thu, 12 Sep 2024 | AIC | 555.610 |
| Time: | 09:51:32 | BIC | 562.089 |
| Sample: | 01-01-2014 | HQIC | 557.536 |
| - 10-01-2021 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| ar.L1 | 0.9072 | 0.202 | 4.488 | 0.000 | 0.511 | 1.303 |
| ma.L1 | -0.9847 | 0.495 | -1.987 | 0.047 | -1.956 | -0.014 |
| ar.S.L4 | -0.8204 | 1.819 | -0.451 | 0.652 | -4.386 | 2.745 |
| ma.S.L4 | 0.8012 | 1.859 | 0.431 | 0.666 | -2.842 | 4.445 |
| sigma2 | 4.237e+07 | 1.69e-08 | 2.51e+15 | 0.000 | 4.24e+07 | 4.24e+07 |
| Ljung-Box (L1) (Q): | 0.01 | Jarque-Bera (JB): | 0.34 |
|---|---|---|---|
| Prob(Q): | 0.94 | Prob(JB): | 0.84 |
| Heteroskedasticity (H): | 3.65 | Skew: | -0.25 |
| Prob(H) (two-sided): | 0.07 | Kurtosis: | 3.23 |
The purpose of the residual analysis is to assess whether the difference between the actual and predicted values of the model is due to randomness or, in other words, that the residuals are normally and independently distributed (Peixeiro, 2022).
To do so, the residual analysis is performed in two steps (Peixeiro, 2022):
# Storing residuals
residuals = sarima_model.resid
# Plotting QQ-Plot
chart_title = 'Diagnostic plots for standardized residuals from SARIMA model'
sarima_model.plot_diagnostics(figsize=(10, 8))
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png', bbox_inches = 'tight')
plt.show()
From the diagnostics plots, the residuals are not completely normally distributed. However, it was considered that the results were good enough for the purposes of the present study.
Then, the residuals were assessed by testing for uncorrelation with the Ljung-Box test (Peixeiro, 2022), using a 95% confidence level:
$$H_{0}: No\text{-}autocorrelation$$$$H_{1}: Autocorrelation$$$$\alpha = 0.95$$# Ljung-Box test of the residuals on 10 lags
acorr_ljungbox(residuals, np.arange(1, 11))
| lb_stat | lb_pvalue | |
|---|---|---|
| 1 | 0.296909 | 0.585827 |
| 2 | 0.297667 | 0.861713 |
| 3 | 0.518560 | 0.914795 |
| 4 | 8.331227 | 0.080171 |
| 5 | 8.331242 | 0.138900 |
| 6 | 8.345408 | 0.213874 |
| 7 | 8.346893 | 0.303001 |
| 8 | 8.347182 | 0.400312 |
| 9 | 8.355757 | 0.498730 |
| 10 | 8.571613 | 0.573183 |
For each of the lags from 1 to 10, the $p\text{-}values$ were above $0.05$. Thus, the null hypothesis cannot be rejected, meaning that no autocorrelation was found on the set of residuals from the SARIMA model. Thus, the residuals are independently distributed and the model can be used for forecasting.
y_pred_sarima_model = sarima_model.get_prediction(start=len(y_train), end=(len(y_train)+len(y_test))).predicted_mean
y_pred_sarima_model
2022-01-01 188700.036884 2022-04-01 192431.179204 2022-07-01 192886.115990 2022-10-01 231412.118662 2023-01-01 206087.318747 2023-04-01 209966.989976 2023-07-01 210934.267963 2023-10-01 249338.432491 Freq: QS-OCT, Name: predicted_mean, dtype: float64
As in the ARIMA model, it was not necessary to bring the predictions back to the original scale of the series, as the SARIMA model handled that process by itself.
The predictions were plot against the historical net sales data to visually assess the performance of the SARIMA model.
plot_predictions(y_pred_sarima_model, 'SARIMA Model')
In view of the plot above, the SARIMA model was able to neatly capture the trend and seasonality of the time series. So far, the best performance obtained.
Then, the RMSE, MAE, and $\bf{r^{2}}$ score were calculated as follows:
sarima_rmse, sarima_mae, sarima_r2 = calculate_scores(pd.DataFrame(y_pred_sarima_model))
net_sales RMSE: 2675.576 MAE: 2372.531 Coefficient of Determination: 0.983
# Adding scores to lists
rmse_list.append(sarima_rmse[0])
mae_list.append(sarima_mae[0])
r2_list.append(sarima_r2[0])
A Seasonal Autoregressive Integrated Moving Average with Exogenous Variables (SARIMAX) model based on the WALMEX net sales, units, GDP, stock value, SP&500, the IPC, exchange rates, and interest rates time series was built.
The SARIMAX model is a combination of the autoregressive process, integration, moving average process for forecasting of non-stationary time series with seasonal patterns, but also including the effects of external variables (Peixeiro, 2022). It is denoted as $\bold{\text{SARIMA}(p,d,q)(P,D,Q)_{m} + \sum_{i=1} ^{n} \beta_i X_t^i}$; where $p$ is the order of the autoregressive process, $d$ is the integration order, $q$ is the order of the moving average process, $m$ is the frequency; $P$, $D$, and $Q$ are the orders for the seasonal parameters for the autoregressive, integration, and moving average processes, respectively; and $X_t$ are any number of exogenous variables with their corresponding coefficients $\beta$ (Peixeiro, 2022).
Similar to the building of the ARMA, ARIMA, and SARIMA models, several orders $p$, $q$, $P$, and $Q$ were tested and the best performing model was selected according to the Akaike Information Criterion (AIC) and a residual analysis (Peixeiro, 2022).
The function SARIMAX from the library statsmodels in Python was used to build the SARIMAX model (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023).
It is assumed that the present values in the time series are linearly dependend on both its past values and on the mean, the current error and the past errors terms (Peixeiro, 2022).
It was also assumed that the residuals from the model were normally distributed, independent, and uncorrelated; approximanting to white noise (Peixeiro, 2022).
Furthermore, from the EDA, it was found that the original time series was not stationary. In this sense, it was proposed to use the first-differenced time series as a basis for seasonal differencing. So, the order of integration for the WALMEX net sales was $d = 1$ for the SARIMAX model.
As the time series is reported in a quaterly basis, the frequency or number of observations per seasonal cycle was defined as $m = 4$. This assumption was also supported by the additive decomposition of the time series carried out above.
Finally, from the seasonal differencing carried out above at the SARIMA model, the seasonal order of integration was defined as $D = 1$.
The dataset was split into a training and a testing sets, allocating $80\%$ and $20\%$ of the data, respectively.
# Endogeneous variable
time_series = y
cutoff = int(0.8*(len(time_series)))
y_train = time_series[:cutoff]
y_test = time_series[cutoff + 1:]
# Exogeneous variables
exog = X.copy()
exog_train = exog.iloc[:cutoff]
exog_test = exog.iloc[cutoff:]
As several values for $p$, $q$, $P$, and $Q$ were tested, a function to fit several SARIMAX models and calculate their corresponding AIC score was defined.
def SARIMAX_model(p, d, q, P, D, Q, m, endog, exog):
"""
Fits and optimize a seasonal autoregressive integrated moving average with exogeneous variables (SARIMAX) model
based on the Akaike Information Criterion (AIC), given a set of p, q, P, and Q values; while keeping the
d and D orders constant. The frequency m is also kept constant.
Parameters:
p (range): Range for order p in the autoregressive portion of the SARIMA model.
d (int): Integration order.
q (range): Range for order q in the moving average portion of the SARIMA model.
P (range): Range for order P in the seasonal autoregressive portion of the SARIMA model.
D (int): Seasonal integration order.
Q (range): Range for order P in the seasonal moving average portion of the SARIMA model.
m (int): Number of observations per seasonal cycle.
endog (pandas.series): Time series of the endogenous variable for fitting the SARIMAX model.
exog (pandas.dataframe): Time series of the exogenous variables for fitting the SARIMAX model.
Returns:
SARIMAX_model (statsmodels.sarimax): An SARIMAX model object fitted according to the combination of p, q, P,
and Q values that minimizes the Akaike Information Criterion (AIC).
"""
# Obtaining the combinations of p and q
order_list = list(product(p, q, P, Q))
# Creating emtpy lists to store results
order_results = []
aic_results = []
# Fitting models
for order in order_list:
SARIMAX_model = SARIMAX(endog=endog,
exog=exog,
order = (order[0], d, order[1]),
seasonal_order=(order[2], D, order[3], m),
).fit(disp=False)
order_results.append((order[0], d, order[1], order[2], D, order[3], m))
aic_results.append(SARIMAX_model.aic)
# Converting lists to dataframes
results = pd.DataFrame({'(p,d,q)(P,D,Q)m': order_results,
'AIC': aic_results
})
# Storing results from the best model
lowest_aic = results.AIC.min()
best_model = results.loc[results['AIC'] == lowest_aic, ['(p,d,q)(P,D,Q)m']].values[0][0]
# Printing results
print(f'The best model is (p = {best_model[0]}, d = {d}, q = {best_model[2]})(P = {best_model[3]}, D = {D}, Q = {best_model[5]})(m = {m}), with an AIC of {lowest_aic:.02f}.\n')
print(results)
# Fitting best model again
SARIMAX_model = SARIMAX(endog=endog,
exog=exog,
order = (best_model[0], d, best_model[2]),
seasonal_order=(best_model[3], D, best_model[5], m),
).fit(disp=False)
return SARIMAX_model
Then, several SARIMAX models were fitted using the order from 1 to 4, as the time series is reported in a quarterly basis:
sarimax_model = SARIMAX_model(p=range(1,4),
d=1,
q=range(1,4),
P=range(1,2),
D=1,
Q=range(1,2),
m=4,
endog=y_train,
exog=exog_train)
The best model is (p = 1, d = 1, q = 1)(P = 1, D = 1, Q = 1)(m = 4), with an AIC of 531.53.
(p,d,q)(P,D,Q)m AIC
0 (1, 1, 1, 1, 1, 1, 4) 531.525500
1 (1, 1, 2, 1, 1, 1, 4) 533.679524
2 (1, 1, 3, 1, 1, 1, 4) 536.454624
3 (2, 1, 1, 1, 1, 1, 4) 533.436486
4 (2, 1, 2, 1, 1, 1, 4) 535.425705
5 (2, 1, 3, 1, 1, 1, 4) 655.107508
6 (3, 1, 1, 1, 1, 1, 4) 534.709970
7 (3, 1, 2, 1, 1, 1, 4) 537.236569
8 (3, 1, 3, 1, 1, 1, 4) 539.914609
So, the model $\text{SARIMA}(1,1,1)(1,1,1)_{4} + \sum_{i=1} ^{n} \beta_i X_t^i$ was found to be the best model relative to all others models fit.
sarimax_model.summary()
| Dep. Variable: | net_sales | No. Observations: | 32 |
|---|---|---|---|
| Model: | SARIMAX(1, 1, 1)x(1, 1, 1, 4) | Log Likelihood | -253.763 |
| Date: | Thu, 12 Sep 2024 | AIC | 531.525 |
| Time: | 09:51:55 | BIC | 547.076 |
| Sample: | 01-01-2014 | HQIC | 536.149 |
| - 10-01-2021 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| units | 56.2518 | 10.479 | 5.368 | 0.000 | 35.714 | 76.790 |
| gdp | 0.0010 | 0.001 | 1.369 | 0.171 | -0.000 | 0.002 |
| walmex | 227.9497 | 210.601 | 1.082 | 0.279 | -184.821 | 640.720 |
| sp500 | -17.1562 | 10.605 | -1.618 | 0.106 | -37.942 | 3.629 |
| ipc | 0.5341 | 0.581 | 0.919 | 0.358 | -0.604 | 1.673 |
| exchange_rates | 666.4456 | 984.956 | 0.677 | 0.499 | -1264.032 | 2596.923 |
| interest_rates | 1893.6433 | 1521.903 | 1.244 | 0.213 | -1089.231 | 4876.518 |
| ar.L1 | -0.7285 | 1.210 | -0.602 | 0.547 | -3.101 | 1.644 |
| ma.L1 | 0.7682 | 1.285 | 0.598 | 0.550 | -1.750 | 3.287 |
| ar.S.L4 | -0.6031 | 4.516 | -0.134 | 0.894 | -9.455 | 8.249 |
| ma.S.L4 | 0.6068 | 4.500 | 0.135 | 0.893 | -8.213 | 9.426 |
| sigma2 | 8.966e+06 | 0.081 | 1.11e+08 | 0.000 | 8.97e+06 | 8.97e+06 |
| Ljung-Box (L1) (Q): | 0.02 | Jarque-Bera (JB): | 1.02 |
|---|---|---|---|
| Prob(Q): | 0.88 | Prob(JB): | 0.60 |
| Heteroskedasticity (H): | 2.19 | Skew: | -0.29 |
| Prob(H) (two-sided): | 0.26 | Kurtosis: | 2.25 |
The purpose of the residual analysis is to assess whether the difference between the actual and predicted values of the model is due to randomness or, in other words, that the residuals are normally and independently distributed (Peixeiro, 2022).
To do so, the residual analysis is performed in two steps (Peixeiro, 2022):
# Storing residuals
residuals = sarimax_model.resid
# Plotting QQ-Plot
chart_title = 'Diagnostic plots for standardized residuals from SARIMAX model'
sarimax_model.plot_diagnostics(figsize=(10, 8))
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png', bbox_inches = 'tight')
plt.show()
From the diagnostics plots, the residuals are not completely normally distributed. However, it was considered that the results were good enough for the purposes of the present study.
Then, the residuals were assessed by testing for uncorrelation with the Ljung-Box test (Peixeiro, 2022), using a 95% confidence level:
$$H_{0}: No\text{-}autocorrelation$$$$H_{1}: Autocorrelation$$$$\alpha = 0.95$$# Ljung-Box test of the residuals on 10 lags
acorr_ljungbox(residuals, np.arange(1, 11))
| lb_stat | lb_pvalue | |
|---|---|---|
| 1 | 0.213270 | 0.644216 |
| 2 | 0.220932 | 0.895417 |
| 3 | 0.825297 | 0.843407 |
| 4 | 9.256681 | 0.054994 |
| 5 | 9.257353 | 0.099229 |
| 6 | 9.259061 | 0.159524 |
| 7 | 9.259126 | 0.234583 |
| 8 | 9.272412 | 0.319839 |
| 9 | 9.272497 | 0.412510 |
| 10 | 9.292785 | 0.504561 |
For each of the lags from 1 to 10, the $p\text{-}values$ were above $0.05$. Thus, the null hypothesis cannot be rejected, meaning that no autocorrelation was found on the set of residuals from the SARIMAX model. Thus, the residuals are independently distributed and the model can be used for forecasting.
Unlike the other models, to get predictions for WALMEX net sales series it was necessary to provide the actual values for the exogenous variables for the testing time window:
y_pred_sarimax_model = sarimax_model.get_prediction(start=len(y_train), end=(len(y_train)+len(y_test)), exog=exog_test).predicted_mean
y_pred_sarimax_model
2022-01-01 193757.410394 2022-04-01 207152.889375 2022-07-01 212480.673967 2022-10-01 258508.919300 2023-01-01 234784.779367 2023-04-01 238285.358637 2023-07-01 235479.175776 2023-10-01 276060.741734 Freq: QS-OCT, Name: predicted_mean, dtype: float64
Thus, for forecasting the WALMEX net sales in the future, with yet-to-know values for the exogenous variables, it would be necessary to forecast them firstly as well to feed the SARIMAX model. Of course, this could cause the prediction errors to accumulate, rendering the SARIMAX model unreliable (Peixeiro, 2022). So, it is recommended to use the SARIMAX model for forecasting only the next period in future (Peixeiro, 2022).
The predictions were plot against the historical net sales data to visually assess the performance of the SARIMA model.
plot_predictions(y_pred_sarimax_model, 'SARIMAX Model')
In view of the plot above, the SARIMAX model was able to capture the trend and seasonality of the time series. However, its performance was inferior to that from the SARIMA model.
Then, the RMSE, MAE, and $\bf{r^{2}}$ score were calculated as follows:
sarimax_rmse, sarimax_mae, sarimax_r2 = calculate_scores(pd.DataFrame(y_pred_sarimax_model))
net_sales RMSE: 21608.095 MAE: 20415.994 Coefficient of Determination: -0.080
# Adding scores to lists
rmse_list.append(sarimax_rmse[0])
mae_list.append(sarimax_mae[0])
r2_list.append(sarimax_r2[0])
A Simple Expotential Smoothing (SES) model based on the WALMEX net sales was built.
The SES model is a smoothening technique that uses an exponential window function (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023). It is useful when a time series do not exhibit neither trend nor seasonality (Atwan, 2022).
The function SimpleExpSmoothing from the library statsmodels in Python was used to build the SES model (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023).
As the SES model requires a time series process to be stationary. So, when using the second-order differenced series, it was assumed that the stationarity requirement was fulfilled.
Furthermore, it was assumed that the seasonal patterns were neglectable for the purposes of this model.
The dataset was split into a training and a testing sets, allocating $80\%$ and $20\%$ of the data, respectively.
time_series = y_diff2
cutoff = int(0.8*(len(time_series)))
y_train = time_series[:cutoff]
y_test = time_series[cutoff + 1:]
# Model Fitting
ses_model = SimpleExpSmoothing(y_train).fit()
ses_model.summary()
| Dep. Variable: | Net Sales | No. Observations: | 30 |
|---|---|---|---|
| Model: | SimpleExpSmoothing | SSE | 29041808623.000 |
| Optimized: | True | AIC | 624.724 |
| Trend: | None | BIC | 627.527 |
| Seasonal: | None | AICC | 626.324 |
| Seasonal Periods: | None | Date: | Thu, 12 Sep 2024 |
| Box-Cox: | False | Time: | 09:52:07 |
| Box-Cox Coeff.: | None |
| coeff | code | optimized | |
|---|---|---|---|
| smoothing_level | 0.0050000 | alpha | True |
| initial_level | -828.00000 | l.0 | False |
y_pred_ses_model = ses_model.forecast(len(y_test) + 1)
y_pred_ses_model
2022-01-01 -529.793623 2022-04-01 -529.793623 2022-07-01 -529.793623 2022-10-01 -529.793623 2023-01-01 -529.793623 2023-04-01 -529.793623 2023-07-01 -529.793623 2023-10-01 -529.793623 Freq: QS-OCT, dtype: float64
# Detransforming predictions to original series scale
y_pred_ses_model = y_diff.loc[y_train.index[-1]].values + y_pred_ses_model.cumsum()
y_pred_ses_model = y.loc[y_train.index[-1]] + y_pred_ses_model.cumsum()
y_pred_ses_model
2022-01-01 251019.206377 2022-04-01 288445.619130 2022-07-01 325342.238261 2022-10-01 361709.063768 2023-01-01 397546.095652 2023-04-01 432853.333913 2023-07-01 467630.778551 2023-10-01 501878.429566 Freq: QS-OCT, dtype: float64
The predictions were plot against the historical net sales data to visually assess the performance of the SES model.
plot_predictions(y_pred_ses_model, 'SES Model')
In view of the plot above, as expectable, the SES model was not able to capture neither the trend nor the seasonality of the time.
Then, the RMSE, MAE, and $\bf{r^{2}}$ score were calculated as follows:
ses_rmse, ses_mae, ses_r2 = calculate_scores(pd.DataFrame(y_pred_ses_model))
net_sales RMSE: 180107.784 MAE: 166655.346 Coefficient of Determination: -74.013
# Adding scores to lists
rmse_list.append(ses_rmse[0])
mae_list.append(ses_mae[0])
r2_list.append(ses_r2[0])
A Holt-Winters (HW) model based on the WALMEX net sales was built.
The HW model is a smoothening technique that uses an exponential weighted moving average process (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023). It is an enhancement over simple exponential smoothing, and can be used when a time series exhibit both a trend and seasonality (Atwan, 2022).
The function ExponentialSmoothing from the library statsmodels in Python was used to build the HW model (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023).
It has been assumed that the futures values of the time series are dependent upon the past errors terms.
The dataset was split into a training and a testing sets, allocating $80\%$ and $20\%$ of the data, respectively.
time_series = y
cutoff = int(0.8*(len(time_series)))
y_train = time_series[:cutoff]
y_test = time_series[cutoff:]
# Model Fitting
hw_model = ExponentialSmoothing(y_train, trend='additive', seasonal='additive', seasonal_periods=4).fit()
hw_model.summary()
| Dep. Variable: | net_sales | No. Observations: | 32 |
|---|---|---|---|
| Model: | ExponentialSmoothing | SSE | 554637676.280 |
| Optimized: | True | AIC | 549.379 |
| Trend: | Additive | BIC | 561.105 |
| Seasonal: | Additive | AICC | 559.855 |
| Seasonal Periods: | 4 | Date: | Thu, 12 Sep 2024 |
| Box-Cox: | False | Time: | 09:52:10 |
| Box-Cox Coeff.: | None |
| coeff | code | optimized | |
|---|---|---|---|
| smoothing_level | 0.6767857 | alpha | True |
| smoothing_trend | 0.0001 | beta | True |
| smoothing_seasonal | 0.1010045 | gamma | True |
| initial_level | 1.0762e+05 | l.0 | True |
| initial_trend | 3002.1621 | b.0 | True |
| initial_seasons.0 | -5954.8672 | s.0 | True |
| initial_seasons.1 | -6667.7734 | s.1 | True |
| initial_seasons.2 | -5347.6172 | s.2 | True |
| initial_seasons.3 | 17970.258 | s.3 | True |
y_pred_hw_model = hw_model.forecast(len(y_test))
y_pred_hw_model
2022-01-01 187096.303296 2022-04-01 189832.517330 2022-07-01 193853.235303 2022-10-01 221905.834217 2023-01-01 199100.035656 2023-04-01 201836.249690 2023-07-01 205856.967663 2023-10-01 233909.566577 Freq: QS-OCT, dtype: float64
The predictions were plot against the historical net sales data to visually assess the performance of the HW model.
plot_predictions(y_pred_hw_model, 'HW Model')
In view of the plot above, the HW model performed notably better than the SES model, as it was able to capture both the trend and seasonality of the WALMEX net sales series.
Then, the RMSE, MAE, and $\bf{r^{2}}$ score were calculated as follows:
hw_rmse, hw_mae, hw_r2 = calculate_scores(pd.DataFrame(y_pred_hw_model))
net_sales RMSE: 9508.312 MAE: 7645.737 Coefficient of Determination: 0.791
# Adding scores to lists
rmse_list.append(hw_rmse[0])
mae_list.append(hw_mae[0])
r2_list.append(hw_r2[0])
A Univariate Time Series model using Prophet based on the WALMEX net sales was built.
Prophet is an open source package built and maintained by Meta for forecasting at scale. It was released in 2017 (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023), and it is able to fit nonlinear trends and multiple seasonalities (Peixeiro, 2022).
Prophet implements a general additive model where a time series $y(t)$ is modeled as the linear combination of a trend $g(t)$, a seasonal component $s(t)$, holiday effects $h(t)$, and an error term $ϵ_{t}$ (Peixeiro, 2022):
$$y(t) = g(t) + s(t) + h(t) + ϵ_{t}$$Even though the model does not take into account any autoregressive process, it has the advantage to be very flexible as "it can accommodate multiple seasonal periods and changing trends [and] it is robust to outliers and missing data" (Peixeiro, 2022):
The function Prophet from the library fbprophet in Python was used to build the model (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023).
It has been assumed that the net sales series has a non-linear trend, at least one seasonal component, holiday effects, and that the error terms are normally distributed.
Firstly, a dataframe with the training set was created with two columns, a date column named ds and a value column named y, as per requirements of the Prophet algorithm.
y_df = pd.DataFrame(y).rename(columns={'net_sales':'y'}).rename_axis('ds').reset_index()
y_df.head()
| ds | y | |
|---|---|---|
| 0 | 2014-01-01 | 101405 |
| 1 | 2014-04-01 | 103300 |
| 2 | 2014-07-01 | 104367 |
| 3 | 2014-10-01 | 128586 |
| 4 | 2015-01-01 | 110875 |
Then, the dataset was split into a training and a testing sets, allocating $80\%$ and $20\%$ of the data, respectively.
time_series = y_df
cutoff = int(0.8*(len(time_series)))
y_train = time_series[:cutoff]
y_test = time_series[cutoff:]
The model built with Prophet was optimized by using cross-validation and hyperparameter tuning.
Specifically, the hypeparameters changepoint_prior_scale and seasonality_prior_scale were tunned.
def prophet_model(series, changepoint_prior_scale, seasonality_prior_scale, metric):
"""
Fits and optimizr a univariate time series model with Prophet, given a set of changepoint_prior_scale and
seasonality_prior_scale values.
Parameters:
changepoint_prior_scale (list): Values for the changepoint_prior_scale hyperparameter in Prophet.
seasonality_prior_scale (list): Values for the seasonality_prior_scale hyperparameter in Prophet.
series (pandas.dataframe): Time series data in two columns: ds for dates in a date datatype, and y for the series.
metric (str): Selected performance metric for optimization: One of 'mse', 'rmse', 'mae', 'mdape', or 'coverage'.
Returns:
m (prophet.prophet): An Prophet model object optimized according to the combination of tested hyperparameters, by using the
indicated metric.
"""
# Obtaining the combinations of hyperparameters
params = list(product(changepoint_prior_scale, seasonality_prior_scale))
# Creating emtpy lists to store performance results
metric_results = []
# Defining cutoff dates
start_cutoff_percentage = 0.5 # 50% of the data will be used for fitting the model
start_cutoff_index = int(round(len(series) * start_cutoff_percentage, 0))
start_cutoff = series.iloc[start_cutoff_index].values[0]
end_cutoff = series.iloc[-4].values[0] # The last fourth value is taken as the series is reported in a quarterly basis
cutoffs = pd.date_range(start=start_cutoff, end=end_cutoff, freq='12M')
# Fitting models
for param in params:
m = Prophet(changepoint_prior_scale=param[0], seasonality_prior_scale=param[1])
m.add_country_holidays(country_name='MX')
m.fit(series)
df_cv = cross_validation(model=m, horizon='365 days', cutoffs=cutoffs)
df_p = performance_metrics(df_cv, rolling_window=1)
metric_results.append(df_p[metric].values[0])
# Converting list to dataframe
results = pd.DataFrame({'Hyperparameters': params,
'Metric': metric_results
})
# Storing results from the best model
best_params = params[np.argmin(metric_results)]
# Printing results
print(f'\nThe best model hyperparameters are changepoint_prior_scale = {best_params[0]}, and seasonality_prior_scale = {best_params[1]}.\n')
print(results)
# Fitting best model again
m = Prophet(changepoint_prior_scale=best_params[0], seasonality_prior_scale=best_params[1])
m.add_country_holidays(country_name='MX')
m.fit(series);
return m
# Defining hyperparameters values
changepoint_prior_scale = [0.001, 0.01, 0.1, 0.5]
seasonality_prior_scale = [0.01, 0.1, 1.0, 10.0]
# Fitting model
prophet_model = prophet_model(y_train, changepoint_prior_scale, seasonality_prior_scale, 'rmse')
09:52:22 - cmdstanpy - INFO - Chain [1] start processing 09:52:24 - cmdstanpy - INFO - Chain [1] done processing 0%| | 0/3 [00:00<?, ?it/s]09:52:35 - cmdstanpy - INFO - Chain [1] start processing 09:52:36 - cmdstanpy - INFO - Chain [1] done processing 33%|███▎ | 1/3 [00:01<00:03, 1.80s/it]09:52:37 - cmdstanpy - INFO - Chain [1] start processing 09:52:38 - cmdstanpy - INFO - Chain [1] done processing 67%|██████▋ | 2/3 [00:02<00:01, 1.29s/it]09:52:38 - cmdstanpy - INFO - Chain [1] start processing 09:52:38 - cmdstanpy - INFO - Chain [1] done processing 100%|██████████| 3/3 [00:03<00:00, 1.07s/it] 09:52:40 - cmdstanpy - INFO - Chain [1] start processing 09:52:40 - cmdstanpy - INFO - Chain [1] done processing 0%| | 0/3 [00:00<?, ?it/s]09:52:40 - cmdstanpy - INFO - Chain [1] start processing 09:52:40 - cmdstanpy - INFO - Chain [1] done processing 33%|███▎ | 1/3 [00:00<00:00, 2.65it/s]09:52:40 - cmdstanpy - INFO - Chain [1] start processing 09:52:40 - cmdstanpy - INFO - Chain [1] done processing 67%|██████▋ | 2/3 [00:00<00:00, 2.42it/s]09:52:41 - cmdstanpy - INFO - Chain [1] start processing 09:52:41 - cmdstanpy - INFO - Chain [1] done processing 100%|██████████| 3/3 [00:01<00:00, 1.86it/s] 09:52:42 - cmdstanpy - INFO - Chain [1] start processing 09:52:42 - cmdstanpy - INFO - Chain [1] done processing 0%| | 0/3 [00:00<?, ?it/s]09:52:42 - cmdstanpy - INFO - Chain [1] start processing 09:52:43 - cmdstanpy - INFO - Chain [1] done processing 33%|███▎ | 1/3 [00:00<00:00, 2.24it/s]09:52:43 - cmdstanpy - INFO - Chain [1] start processing 09:52:43 - cmdstanpy - INFO - Chain [1] done processing 67%|██████▋ | 2/3 [00:00<00:00, 2.03it/s]09:52:43 - cmdstanpy - INFO - Chain [1] start processing 09:52:44 - cmdstanpy - INFO - Chain [1] done processing 100%|██████████| 3/3 [00:01<00:00, 1.72it/s] 09:52:44 - cmdstanpy - INFO - Chain [1] start processing 09:52:45 - cmdstanpy - INFO - Chain [1] done processing 0%| | 0/3 [00:00<?, ?it/s]09:52:45 - cmdstanpy - INFO - Chain [1] start processing 09:52:45 - cmdstanpy - INFO - Chain [1] done processing 33%|███▎ | 1/3 [00:00<00:00, 2.41it/s]09:52:45 - cmdstanpy - INFO - Chain [1] start processing 09:52:46 - cmdstanpy - INFO - Chain [1] done processing 67%|██████▋ | 2/3 [00:01<00:00, 1.44it/s]09:52:46 - cmdstanpy - INFO - Chain [1] start processing 09:52:46 - cmdstanpy - INFO - Chain [1] done processing 100%|██████████| 3/3 [00:02<00:00, 1.47it/s] 09:52:47 - cmdstanpy - INFO - Chain [1] start processing 09:52:47 - cmdstanpy - INFO - Chain [1] done processing 0%| | 0/3 [00:00<?, ?it/s]09:52:48 - cmdstanpy - INFO - Chain [1] start processing 09:52:48 - cmdstanpy - INFO - Chain [1] done processing 33%|███▎ | 1/3 [00:00<00:00, 2.23it/s]09:52:48 - cmdstanpy - INFO - Chain [1] start processing 09:52:48 - cmdstanpy - INFO - Chain [1] done processing 67%|██████▋ | 2/3 [00:00<00:00, 2.09it/s]09:52:48 - cmdstanpy - INFO - Chain [1] start processing 09:52:49 - cmdstanpy - INFO - Chain [1] done processing 100%|██████████| 3/3 [00:01<00:00, 2.18it/s] 09:52:49 - cmdstanpy - INFO - Chain [1] start processing 09:52:49 - cmdstanpy - INFO - Chain [1] done processing 0%| | 0/3 [00:00<?, ?it/s]09:52:49 - cmdstanpy - INFO - Chain [1] start processing 09:52:50 - cmdstanpy - INFO - Chain [1] done processing 33%|███▎ | 1/3 [00:00<00:00, 2.03it/s]09:52:50 - cmdstanpy - INFO - Chain [1] start processing 09:52:50 - cmdstanpy - INFO - Chain [1] done processing 67%|██████▋ | 2/3 [00:01<00:00, 1.96it/s]09:52:51 - cmdstanpy - INFO - Chain [1] start processing 09:52:51 - cmdstanpy - INFO - Chain [1] done processing 100%|██████████| 3/3 [00:01<00:00, 1.96it/s] 09:52:51 - cmdstanpy - INFO - Chain [1] start processing 09:52:51 - cmdstanpy - INFO - Chain [1] done processing 0%| | 0/3 [00:00<?, ?it/s]09:52:52 - cmdstanpy - INFO - Chain [1] start processing 09:52:52 - cmdstanpy - INFO - Chain [1] done processing 33%|███▎ | 1/3 [00:00<00:00, 2.38it/s]09:52:52 - cmdstanpy - INFO - Chain [1] start processing 09:52:52 - cmdstanpy - INFO - Chain [1] done processing 67%|██████▋ | 2/3 [00:00<00:00, 1.98it/s]09:52:53 - cmdstanpy - INFO - Chain [1] start processing 09:52:53 - cmdstanpy - INFO - Chain [1] done processing 100%|██████████| 3/3 [00:01<00:00, 1.98it/s] 09:52:53 - cmdstanpy - INFO - Chain [1] start processing 09:52:54 - cmdstanpy - INFO - Chain [1] done processing 0%| | 0/3 [00:00<?, ?it/s]09:52:54 - cmdstanpy - INFO - Chain [1] start processing 09:52:54 - cmdstanpy - INFO - Chain [1] done processing 33%|███▎ | 1/3 [00:00<00:00, 2.19it/s]09:52:54 - cmdstanpy - INFO - Chain [1] start processing 09:52:54 - cmdstanpy - INFO - Chain [1] done processing 67%|██████▋ | 2/3 [00:00<00:00, 2.13it/s]09:52:55 - cmdstanpy - INFO - Chain [1] start processing 09:52:55 - cmdstanpy - INFO - Chain [1] done processing 100%|██████████| 3/3 [00:01<00:00, 2.09it/s] 09:52:55 - cmdstanpy - INFO - Chain [1] start processing 09:52:57 - cmdstanpy - INFO - Chain [1] done processing 0%| | 0/3 [00:00<?, ?it/s]09:52:58 - cmdstanpy - INFO - Chain [1] start processing 09:52:58 - cmdstanpy - INFO - Chain [1] done processing 33%|███▎ | 1/3 [00:00<00:01, 1.58it/s]09:52:58 - cmdstanpy - INFO - Chain [1] start processing 09:52:58 - cmdstanpy - INFO - Chain [1] done processing 67%|██████▋ | 2/3 [00:01<00:00, 1.64it/s]09:52:59 - cmdstanpy - INFO - Chain [1] start processing 09:52:59 - cmdstanpy - INFO - Chain [1] done processing 100%|██████████| 3/3 [00:01<00:00, 1.64it/s] 09:52:59 - cmdstanpy - INFO - Chain [1] start processing 09:53:00 - cmdstanpy - INFO - Chain [1] done processing 0%| | 0/3 [00:00<?, ?it/s]09:53:00 - cmdstanpy - INFO - Chain [1] start processing 09:53:00 - cmdstanpy - INFO - Chain [1] done processing 33%|███▎ | 1/3 [00:00<00:01, 1.92it/s]09:53:00 - cmdstanpy - INFO - Chain [1] start processing 09:53:01 - cmdstanpy - INFO - Chain [1] done processing 67%|██████▋ | 2/3 [00:01<00:00, 1.64it/s]09:53:01 - cmdstanpy - INFO - Chain [1] start processing 09:53:02 - cmdstanpy - INFO - Chain [1] done processing 100%|██████████| 3/3 [00:01<00:00, 1.54it/s] 09:53:02 - cmdstanpy - INFO - Chain [1] start processing 09:53:02 - cmdstanpy - INFO - Chain [1] done processing 0%| | 0/3 [00:00<?, ?it/s]09:53:03 - cmdstanpy - INFO - Chain [1] start processing 09:53:03 - cmdstanpy - INFO - Chain [1] done processing 33%|███▎ | 1/3 [00:00<00:01, 1.75it/s]09:53:03 - cmdstanpy - INFO - Chain [1] start processing 09:53:04 - cmdstanpy - INFO - Chain [1] done processing 67%|██████▋ | 2/3 [00:01<00:00, 1.43it/s]09:53:04 - cmdstanpy - INFO - Chain [1] start processing 09:53:04 - cmdstanpy - INFO - Chain [1] done processing 100%|██████████| 3/3 [00:01<00:00, 1.50it/s] 09:53:05 - cmdstanpy - INFO - Chain [1] start processing 09:53:05 - cmdstanpy - INFO - Chain [1] done processing 0%| | 0/3 [00:00<?, ?it/s]09:53:05 - cmdstanpy - INFO - Chain [1] start processing 09:53:06 - cmdstanpy - INFO - Chain [1] done processing 33%|███▎ | 1/3 [00:00<00:01, 1.32it/s]09:53:06 - cmdstanpy - INFO - Chain [1] start processing 09:53:06 - cmdstanpy - INFO - Chain [1] done processing 67%|██████▋ | 2/3 [00:01<00:00, 1.52it/s]09:53:07 - cmdstanpy - INFO - Chain [1] start processing 09:53:07 - cmdstanpy - INFO - Chain [1] done processing 100%|██████████| 3/3 [00:01<00:00, 1.53it/s] 09:53:07 - cmdstanpy - INFO - Chain [1] start processing 09:53:08 - cmdstanpy - INFO - Chain [1] done processing 0%| | 0/3 [00:00<?, ?it/s]09:53:08 - cmdstanpy - INFO - Chain [1] start processing 09:53:28 - cmdstanpy - INFO - Chain [1] done processing 33%|███▎ | 1/3 [00:20<00:40, 20.23s/it]09:53:28 - cmdstanpy - INFO - Chain [1] start processing 09:53:28 - cmdstanpy - INFO - Chain [1] done processing 67%|██████▋ | 2/3 [00:20<00:08, 8.63s/it]09:53:29 - cmdstanpy - INFO - Chain [1] start processing 09:54:00 - cmdstanpy - INFO - Chain [1] done processing 100%|██████████| 3/3 [00:51<00:00, 17.30s/it] 09:54:00 - cmdstanpy - INFO - Chain [1] start processing 09:54:00 - cmdstanpy - INFO - Chain [1] done processing 0%| | 0/3 [00:00<?, ?it/s]09:54:01 - cmdstanpy - INFO - Chain [1] start processing 09:54:25 - cmdstanpy - INFO - Chain [1] done processing 33%|███▎ | 1/3 [00:24<00:48, 24.37s/it]09:54:25 - cmdstanpy - INFO - Chain [1] start processing 09:54:54 - cmdstanpy - INFO - Chain [1] done processing 67%|██████▋ | 2/3 [00:53<00:27, 27.10s/it]09:54:54 - cmdstanpy - INFO - Chain [1] start processing 09:55:25 - cmdstanpy - INFO - Chain [1] done processing 100%|██████████| 3/3 [01:24<00:00, 28.22s/it] 09:55:25 - cmdstanpy - INFO - Chain [1] start processing 09:55:26 - cmdstanpy - INFO - Chain [1] done processing 0%| | 0/3 [00:00<?, ?it/s]09:55:26 - cmdstanpy - INFO - Chain [1] start processing 09:55:49 - cmdstanpy - INFO - Chain [1] done processing 33%|███▎ | 1/3 [00:22<00:44, 22.47s/it]09:55:49 - cmdstanpy - INFO - Chain [1] start processing 09:56:16 - cmdstanpy - INFO - Chain [1] done processing 67%|██████▋ | 2/3 [00:49<00:25, 25.24s/it]09:56:16 - cmdstanpy - INFO - Chain [1] start processing 09:56:45 - cmdstanpy - INFO - Chain [1] done processing 100%|██████████| 3/3 [01:18<00:00, 26.19s/it] 09:56:45 - cmdstanpy - INFO - Chain [1] start processing 09:57:19 - cmdstanpy - INFO - Chain [1] done processing 0%| | 0/3 [00:00<?, ?it/s]09:57:19 - cmdstanpy - INFO - Chain [1] start processing 09:57:39 - cmdstanpy - INFO - Chain [1] done processing 33%|███▎ | 1/3 [00:20<00:40, 20.40s/it]09:57:39 - cmdstanpy - INFO - Chain [1] start processing 09:58:08 - cmdstanpy - INFO - Chain [1] done processing 67%|██████▋ | 2/3 [00:49<00:25, 25.61s/it]09:58:08 - cmdstanpy - INFO - Chain [1] start processing 09:58:41 - cmdstanpy - INFO - Chain [1] done processing 100%|██████████| 3/3 [01:22<00:00, 27.48s/it]
The best model hyperparameters are changepoint_prior_scale = 0.01, and seasonality_prior_scale = 0.1. Hyperparameters Metric 0 (0.001, 0.01) 5984.533579 1 (0.001, 0.1) 5901.415336 2 (0.001, 1.0) 5906.996465 3 (0.001, 10.0) 6122.661240 4 (0.01, 0.01) 5992.136908 5 (0.01, 0.1) 5867.370102 6 (0.01, 1.0) 5884.870818 7 (0.01, 10.0) 6142.011762 8 (0.1, 0.01) 9667.252523 9 (0.1, 0.1) 9482.562980 10 (0.1, 1.0) 9295.895359 11 (0.1, 10.0) 9485.383653 12 (0.5, 0.01) 11788.038040 13 (0.5, 0.1) 10063.604415 14 (0.5, 1.0) 6934.052458 15 (0.5, 10.0) 8676.735767
09:58:41 - cmdstanpy - INFO - Chain [1] start processing 09:58:42 - cmdstanpy - INFO - Chain [1] done processing
# Creating a DataFrame to hold the predictions from Prophet
future = y_df.drop(columns=['y'])
# Generating the forecast using the predict method
forecast = prophet_model.predict(future)
forecast.tail(len(y_test))
| ds | trend | yhat_lower | yhat_upper | trend_lower | trend_upper | Benito Juárez's birthday | Benito Juárez's birthday_lower | Benito Juárez's birthday_upper | Change of Federal Government | ... | holidays | holidays_lower | holidays_upper | yearly | yearly_lower | yearly_upper | multiplicative_terms | multiplicative_terms_lower | multiplicative_terms_upper | yhat | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 32 | 2022-01-01 | 192027.687645 | 180255.936895 | 188778.534953 | 192027.686952 | 192027.688366 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 244.376869 | 244.376869 | 244.376869 | -7828.252084 | -7828.252084 | -7828.252084 | 0.0 | 0.0 | 0.0 | 184443.812430 |
| 33 | 2022-04-01 | 194618.874577 | 182512.447980 | 191017.965417 | 194618.872328 | 194618.876678 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | -7860.926269 | -7860.926269 | -7860.926269 | 0.0 | 0.0 | 0.0 | 186757.948308 |
| 34 | 2022-07-01 | 197238.852476 | 186383.625295 | 194896.028790 | 197238.848342 | 197238.856577 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | -6513.078378 | -6513.078378 | -6513.078378 | 0.0 | 0.0 | 0.0 | 190725.774097 |
| 35 | 2022-10-01 | 199887.621340 | 215647.594199 | 223629.817263 | 199887.614879 | 199887.627719 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 19668.209751 | 19668.209751 | 19668.209751 | 0.0 | 0.0 | 0.0 | 219555.831091 |
| 36 | 2023-01-01 | 202536.390204 | 192752.981089 | 200595.402548 | 202536.380740 | 202536.399455 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 244.376869 | 244.376869 | 244.376869 | -6233.331246 | -6233.331246 | -6233.331246 | 0.0 | 0.0 | 0.0 | 196547.435827 |
| 37 | 2023-04-01 | 205127.577136 | 192300.026557 | 200730.678387 | 205127.564601 | 205127.589638 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | -8518.680619 | -8518.680619 | -8518.680619 | 0.0 | 0.0 | 0.0 | 196608.896518 |
| 38 | 2023-07-01 | 207747.555035 | 198434.753990 | 206212.384308 | 207747.539590 | 207747.571169 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | -5437.311670 | -5437.311670 | -5437.311670 | 0.0 | 0.0 | 0.0 | 202310.243365 |
| 39 | 2023-10-01 | 210396.323899 | 226362.066637 | 234502.104146 | 210396.305080 | 210396.344076 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 20150.067102 | 20150.067102 | 20150.067102 | 0.0 | 0.0 | 0.0 | 230546.391000 |
8 rows × 43 columns
As can be seen in the table above, the Prophet algorithm provides a lot of details regarding the fit and its individual components such as trend, additive terms, yearly seasonality, and multiplicative terms. The predictions are hold on the $yhat$ column.
Moreover, it is possible to assess visually the forecast components of the fit using the plot_components method.
chart_title = 'WALMEX Net Sales Forecast Components from Prophet Model'
prophet_model.plot_components(forecast)
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png', bbox_inches = 'tight')
plt.show()
Thus, according to the chart above, the model has detected strong seasonalities in the WALMEX net sales at the beginning and at the end of the year, which may correspond to several important holidays in Mexico such as the Independence Day and Christmas.
y_pred_prophet_model = (forecast[['ds','yhat']].tail(len(y_test))
.rename(columns={'ds':'date', 'yhat':'net_sales'})
.set_index('date').loc[:,'net_sales'])
y_pred_prophet_model
date 2022-01-01 184443.812430 2022-04-01 186757.948308 2022-07-01 190725.774097 2022-10-01 219555.831091 2023-01-01 196547.435827 2023-04-01 196608.896518 2023-07-01 202310.243365 2023-10-01 230546.391000 Name: net_sales, dtype: float64
The predictions were plot against the historical net sales data to visually assess the performance of the Prophet model.
plot_predictions(y_pred_prophet_model, 'Prophet Model')
In view of the plot above, the Prophet model performed well as it was able to capture both the trend and seasonality of the WALMEX net sales series.
Then, the RMSE, MAE, and $\bf{r^{2}}$ score were calculated as follows:
prophet_rmse, prophet_mae, prophet_r2 = calculate_scores(pd.DataFrame(y_pred_prophet_model))
net_sales RMSE: 12323.068 MAE: 10710.708 Coefficient of Determination: 0.649
# Adding scores to lists
rmse_list.append(prophet_rmse[0])
mae_list.append(prophet_mae[0])
r2_list.append(prophet_r2[0])
As shown above, the original dataset comprised $40$ historical data points about $7$ features (economic indicators) and $1$ response variable (net sales of WALMEX in millions of MXN). However, based on the results from the Johanssen's cointegration test, only the series units, SP&500, and WALMEX shared enough the same stochastic trend to be used in a multivariate times series model.
A first model for multivariate times series prediction was built using a Vector Autoregression (VAR) model for forecasting the $3$ selected features. A VAR model was selected due to its ability to forecast multiple features in an easy manner by using the library statsmodels in Python (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023).
The VAR($p$) model describes the relationship among two or more time series and the impact that their past values have on each other (Peixeiro, 2022). In this sense, the VAR($p$) model is a generalization of the AR model for multiple time series, and the order $p$ determines how many lagged values impact the present value of the different time series (Peixeiro, 2022).
Several orders of $p$ were tested and the best performing model was selected according to the Akaike Information Criterion (AIC). Then, the Granger causality test was applied, which determines whether past values of a time series are statistically significant in forecasting another time series. Then, a residual analysis was finally carried out to assess whether the residuals were close to white noise (Peixeiro, 2022).
The function VAR from the library statsmodels in Python was used to build the VAR model (Kulkarni, Shivananda, Kulkarni, & Krishnan, 2023).
It is assumed that the features are cointegrated, i.e., the different features share an underlying stochastic trend, and that the linear combination of the different features is stationary (Clower, 2020).
Moreover, it was assumed that the current values are linearly dependent on their past values (Peixeiro, 2022); and that the different time series were stationary, they were not a random walk, and seasonality effects were not relevant for modeling.
The dataset was split into a training and a testing sets, allocating $80\%$ and $20\%$ of the data, respectively.
X_model = X[['units', 'sp500', 'walmex']]
X_diff_model = X_diff[['units', 'sp500', 'walmex']]
cutoff = int(0.8*(len(X_diff_model)))
X_train = X_diff_model[:cutoff + 1]
X_test = X_diff_model[cutoff:]
Later, the VAR model was built using the class VAR from the statsmodels library.
def VAR_model(p, series):
"""
Fits and optimizar VAR models using the method VAR from statsmodels given a set of lags, and returns the one
with the lowest Akaike Information Criterion (AIC).
Parameters:
p (list): Lag values.
series (pandas.dataframe): Time series data.
Returns:
model (statsmodels.var): VAR model object optimized according to the AIC criterion.
"""
# Creating empty lists to store results
aic_results = []
for lag in p:
VAR_model = VAR(endog=series).fit(maxlags=lag)
aic_results.append(VAR_model.aic)
# Converting lists to dataframes
results = pd.DataFrame({'(p)': p,
'AIC': aic_results
})
# Storing results from the best model
lowest_aic = results.AIC.min()
best_model = results.loc[results['AIC'] == lowest_aic, ['(p)']].values[0][0]
# Printing results
print(f'The best model is (p = {best_model}), with an AIC of {lowest_aic:.02f}.\n')
print(results)
# Fitting best model again
VAR_model = VAR(endog=series).fit(maxlags=best_model)
return VAR_model
p = list(range(1,6))
var_model = VAR_model(p=p, series=X_train)
The best model is (p = 1), with an AIC of 19.86. (p) AIC 0 1 19.864087 1 2 20.150298 2 3 20.270986 3 4 20.398069 4 5 20.817201
var_model.summary()
Summary of Regression Results
==================================
Model: VAR
Method: OLS
Date: Thu, 12, Sep, 2024
Time: 09:58:55
--------------------------------------------------------------------
No. of Equations: 3.00000 BIC: 20.4192
Nobs: 31.0000 HQIC: 20.0450
Log likelihood: -427.855 FPE: 4.25352e+08
AIC: 19.8641 Det(Omega_mle): 2.95549e+08
--------------------------------------------------------------------
Results for equation units
============================================================================
coefficient std. error t-stat prob
----------------------------------------------------------------------------
const 20.022032 12.301704 1.628 0.104
L1.units -0.164855 0.180112 -0.915 0.360
L1.sp500 0.000551 0.076224 0.007 0.994
L1.walmex 5.863524 3.506584 1.672 0.094
============================================================================
Results for equation sp500
============================================================================
coefficient std. error t-stat prob
----------------------------------------------------------------------------
const 63.703334 32.214811 1.977 0.048
L1.units 0.215674 0.471664 0.457 0.647
L1.sp500 0.309229 0.199610 1.549 0.121
L1.walmex -9.867439 9.182789 -1.075 0.283
============================================================================
Results for equation walmex
============================================================================
coefficient std. error t-stat prob
----------------------------------------------------------------------------
const 0.600297 0.662098 0.907 0.365
L1.units 0.006522 0.009694 0.673 0.501
L1.sp500 0.006112 0.004103 1.490 0.136
L1.walmex 0.014669 0.188730 0.078 0.938
============================================================================
Correlation matrix of residuals
units sp500 walmex
units 1.000000 0.129691 -0.082913
sp500 0.129691 1.000000 0.221034
walmex -0.082913 0.221034 1.000000
The Granger causality test was applied to determine whether past values of a time series are statistically significant in forecasting another time series (Peixeiro, 2022). This test is unidirectional, so it has to be applied twice for each series.
$$H_{0}: y_{2,t}\text{ does not Granger-cause }y_{1,t}$$$$H_{1}: y_{2,t}\text{ Granger-cause }y_{1,t}$$$$\alpha = 0.95$$def test_granger_causality(df, lag, alpha):
"""
Performs the Granger causality tests for each pair of stationary time series, in both directions y2 --> y1 and y1 --> y2
By using the grangercausalitytests method from statmodels.
Parameters:
df (pandas.dataframe): Stationary time series data.
lag (int): Lag for whose value the test is computed.
alpha (float): Desired significance level for the hypothesis test.
Returns:
summary (pandas.dataframe): Summary results from the Granger causality tests.
"""
# Creating empty lists
y2 = []
y1 = []
interpretation = []
ssr_ftest_pvalues = []
ssr_chi2test_pvalues = []
lrtest_pvalues = []
params_ftest_pvalues = []
gc = []
# Iterating over series biderectionally
arrays = list(combinations_with_replacement(df.columns, 2))
for array in arrays:
if array[1] == array[0]:
continue
# y2 --> y1
print(f'{array[1]} --> {array[0]}')
result = grangercausalitytests(df[[array[0],array[1]]], maxlag=[lag])
ssr_ftest_pvalue = result[lag][0]['ssr_ftest'][1]
ssr_chi2test_pvalue = result[lag][0]['ssr_chi2test'][1]
lrtest_pvalue = result[lag][0]['lrtest'][1]
params_ftest_pvalue = result[lag][0]['params_ftest'][1]
y2.append(array[1])
y1.append(array[0])
ssr_ftest_pvalues.append(round(ssr_ftest_pvalue,4))
ssr_chi2test_pvalues.append(round(ssr_chi2test_pvalue, 4))
lrtest_pvalues.append(round(lrtest_pvalue, 4))
params_ftest_pvalues.append(round(params_ftest_pvalue, 4))
# Test interpretation
if ssr_ftest_pvalue <= alpha and ssr_chi2test_pvalue <= alpha and lrtest_pvalue <= alpha and params_ftest_pvalue <= alpha:
result = f'{array[1]} Granger-causes {array[0]}.'
print(f'\n{result}\n\n')
interpretation.append(result)
gc.append('True')
elif ssr_ftest_pvalue > alpha and ssr_chi2test_pvalue > alpha and lrtest_pvalue > alpha and params_ftest_pvalue > alpha:
result = f'{array[1]} does not Granger-causes {array[0]}.'
print(f'\n{result}\n\n')
interpretation.append(result)
gc.append('False')
else:
result = f'Inconsistent results among the tests.'
print(f'\n{result}\n\n')
interpretation.append(result)
gc.append('Uncertain')
# y1 --> y2
print(f'{array[0]} --> {array[1]}')
result = grangercausalitytests(df[[array[1],array[0]]], maxlag=[lag])
ssr_ftest_pvalue = result[lag][0]['ssr_ftest'][1]
ssr_chi2test_pvalue = result[lag][0]['ssr_chi2test'][1]
lrtest_pvalue = result[lag][0]['lrtest'][1]
params_ftest_pvalue = result[lag][0]['params_ftest'][1]
y2.append(array[0])
y1.append(array[1])
ssr_ftest_pvalues.append(round(ssr_ftest_pvalue, 4))
ssr_chi2test_pvalues.append(round(ssr_chi2test_pvalue, 4))
lrtest_pvalues.append(round(lrtest_pvalue, 4))
params_ftest_pvalues.append(round(params_ftest_pvalue, 4))
# Test interpretation
if ssr_ftest_pvalue <= alpha and ssr_chi2test_pvalue <= alpha and lrtest_pvalue <= alpha and params_ftest_pvalue <= alpha:
result = f'{array[0]} Granger-causes {array[1]}.'
print(f'\n{result}\n\n')
interpretation.append(result)
gc.append('True')
elif ssr_ftest_pvalue > alpha and ssr_chi2test_pvalue > alpha and lrtest_pvalue > alpha and params_ftest_pvalue > alpha:
result = f'{array[0]} does not Granger-causes {array[1]}.'
print(f'\n{result}\n\n')
interpretation.append(result)
gc.append('False')
else:
result = f'Inconsistent results among the tests.'
print(f'\n{result}\n\n')
interpretation.append(result)
gc.append('Uncertain')
# Building dataframe with summary results
results_dict = {'y2':y2, 'y1':y1, 'Granger Causality':gc,
'Test Interpretation':interpretation,
'SSR F-test p-values': ssr_ftest_pvalues,
'SSR Chi2-test p-values':ssr_chi2test_pvalues,
'LR-test p-values': lrtest_pvalues,
'Params F-test p-values': params_ftest_pvalues
}
summary = pd.DataFrame(results_dict)
return summary
gc_summary = test_granger_causality(df=X_diff_model, lag=1, alpha=0.05)
gc_summary
sp500 --> units Granger Causality number of lags (no zero) 1 ssr based F test: F=0.3146 , p=0.5784 , df_denom=35, df_num=1 ssr based chi2 test: chi2=0.3416 , p=0.5589 , df=1 likelihood ratio test: chi2=0.3401 , p=0.5598 , df=1 parameter F test: F=0.3146 , p=0.5784 , df_denom=35, df_num=1 sp500 does not Granger-causes units. units --> sp500 Granger Causality number of lags (no zero) 1 ssr based F test: F=0.4924 , p=0.4875 , df_denom=35, df_num=1 ssr based chi2 test: chi2=0.5346 , p=0.4647 , df=1 likelihood ratio test: chi2=0.5309 , p=0.4662 , df=1 parameter F test: F=0.4924 , p=0.4875 , df_denom=35, df_num=1 units does not Granger-causes sp500. walmex --> units Granger Causality number of lags (no zero) 1 ssr based F test: F=1.4071 , p=0.2435 , df_denom=35, df_num=1 ssr based chi2 test: chi2=1.5277 , p=0.2165 , df=1 likelihood ratio test: chi2=1.4978 , p=0.2210 , df=1 parameter F test: F=1.4071 , p=0.2435 , df_denom=35, df_num=1 walmex does not Granger-causes units. units --> walmex Granger Causality number of lags (no zero) 1 ssr based F test: F=0.7256 , p=0.4001 , df_denom=35, df_num=1 ssr based chi2 test: chi2=0.7878 , p=0.3748 , df=1 likelihood ratio test: chi2=0.7797 , p=0.3772 , df=1 parameter F test: F=0.7256 , p=0.4001 , df_denom=35, df_num=1 units does not Granger-causes walmex. walmex --> sp500 Granger Causality number of lags (no zero) 1 ssr based F test: F=1.0120 , p=0.3213 , df_denom=35, df_num=1 ssr based chi2 test: chi2=1.0988 , p=0.2945 , df=1 likelihood ratio test: chi2=1.0832 , p=0.2980 , df=1 parameter F test: F=1.0120 , p=0.3213 , df_denom=35, df_num=1 walmex does not Granger-causes sp500. sp500 --> walmex Granger Causality number of lags (no zero) 1 ssr based F test: F=1.6312 , p=0.2099 , df_denom=35, df_num=1 ssr based chi2 test: chi2=1.7710 , p=0.1833 , df=1 likelihood ratio test: chi2=1.7310 , p=0.1883 , df=1 parameter F test: F=1.6312 , p=0.2099 , df_denom=35, df_num=1 sp500 does not Granger-causes walmex.
| y2 | y1 | Granger Causality | Test Interpretation | SSR F-test p-values | SSR Chi2-test p-values | LR-test p-values | Params F-test p-values | |
|---|---|---|---|---|---|---|---|---|
| 0 | sp500 | units | False | sp500 does not Granger-causes units. | 0.5784 | 0.5589 | 0.5598 | 0.5784 |
| 1 | units | sp500 | False | units does not Granger-causes sp500. | 0.4875 | 0.4647 | 0.4662 | 0.4875 |
| 2 | walmex | units | False | walmex does not Granger-causes units. | 0.2435 | 0.2165 | 0.2210 | 0.2435 |
| 3 | units | walmex | False | units does not Granger-causes walmex. | 0.4001 | 0.3748 | 0.3772 | 0.4001 |
| 4 | walmex | sp500 | False | walmex does not Granger-causes sp500. | 0.3213 | 0.2945 | 0.2980 | 0.3213 |
| 5 | sp500 | walmex | False | sp500 does not Granger-causes walmex. | 0.2099 | 0.1833 | 0.1883 | 0.2099 |
Even though, the results from the Granger-causality tests suggested that it is not possible to reject the null hypothesis that the series does not Granger-causes each other, thus rendering the VAR model invalid, it was decided to go further with the model.
The purpose of the residual analysis is to assess whether the difference between the actual and predicted values of the model is due to randomness or, in other words, that the residuals are normally and independently distributed (Peixeiro, 2022).
To do so, the residual analysis is performed in two steps (Peixeiro, 2022):
# Storing residuals
sc_X = StandardScaler()
residuals = sc_X.fit_transform(var_model.resid)
base_chart_title = 'Q-Q plot for standardized residuals for '
cols = X_model.columns.values
# Loop
pos = 1
for i in range(0,len(cols)):
# Plots
plt.figure(figsize=(8,15))
ax = plt.subplot(len(cols), 1, pos)
sm.qqplot(residuals[:,i], line ='45', ax=ax)
chart_title = base_chart_title + cols[i] + ' series from VAR model'
plt.title(chart_title)
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png', bbox_inches = 'tight')
pos=+1
plt.show()
From the plot above, it is clear that the standardized residuals for the series units and SP&500 did not follow a normal distribution, which means that the VAR model was not able to capture some information from the data. Despite this result, the analysis continued below.
Then, the residuals were assessed by testing for uncorrelation with the Ljung-Box test (Peixeiro, 2022), using a 95% confidence level:
$$H_{0}: No\text{-}autocorrelation$$$$H_{1}: Autocorrelation$$$$\alpha = 0.95$$# Ljung-Box test of the residuals on 10 lags
cols = X_model.columns.values
for i in range(0, len(cols)):
print(cols[i])
print(acorr_ljungbox(residuals[:,i], np.arange(1, 11)))
print("\n")
units
lb_stat lb_pvalue
1 0.031080 0.860061
2 1.095988 0.578108
3 1.757447 0.624238
4 3.141998 0.534350
5 4.402226 0.493071
6 4.764572 0.574344
7 5.555494 0.592499
8 7.817132 0.451534
9 8.552430 0.479570
10 8.901256 0.541500
sp500
lb_stat lb_pvalue
1 0.201141 0.653801
2 0.878931 0.644381
3 0.918750 0.820901
4 2.046825 0.727147
5 2.403074 0.791016
6 3.670298 0.721191
7 3.944434 0.786155
8 5.081352 0.748847
9 5.760660 0.763613
10 6.624347 0.760369
walmex
lb_stat lb_pvalue
1 0.296180 0.586287
2 1.978856 0.371789
3 2.822988 0.419730
4 3.989120 0.407480
5 4.279314 0.509941
6 6.838359 0.336055
7 7.050704 0.423620
8 9.247222 0.321872
9 9.670043 0.377851
10 9.824165 0.456053
For each of the lags from 1 to 10, the $p\text{-}values$ were above $0.05$. Thus, the null hypothesis cannot be rejected, meaning that no autocorrelation was found on the set of residuals from the VAR model. Thus, the residuals are independently distributed and the model could be used for forecasting.
The model was then used to forecast the features in the dataset, using the length of the testing set.
# Forecasting
X_pred_var_model = var_model.forecast(var_model.endog, steps=len(X_test))
X_pred_var_model
array([[25.9666079 , 10.59689735, -0.14430718],
[14.90100107, 74.00447175, 0.83230058],
[22.48650828, 81.58875412, 1.16197562],
[23.17323618, 82.31698421, 1.26263712],
[23.65065798, 81.69701243, 1.2730433 ],
[23.63262811, 81.50558432, 1.27252074],
[23.63243097, 81.44765688, 1.27122556],
[23.62483721, 81.44248167, 1.27085125]])
# Bringing predictions to original scale
for i in range(0, len(X_model.columns)):
X_pred_var_model[:,i] = X_model[:cutoff+1].values[-1,i] + X_pred_var_model[:,i].cumsum()
# Converting array of predictions into a dataframe
features_names = X_model.columns
index_labels = X_diff_model[cutoff:].index
X_pred_var_model = pd.DataFrame(data=X_pred_var_model, index=index_labels, columns=features_names)
X_pred_var_model
| units | sp500 | walmex | |
|---|---|---|---|
| date | |||
| 2022-01-01 | 3645.966608 | 4610.943987 | 72.843919 |
| 2022-04-01 | 3660.867609 | 4684.948458 | 73.676219 |
| 2022-07-01 | 3683.354117 | 4766.537212 | 74.838195 |
| 2022-10-01 | 3706.527353 | 4848.854197 | 76.100832 |
| 2023-01-01 | 3730.178011 | 4930.551209 | 77.373875 |
| 2023-04-01 | 3753.810640 | 5012.056793 | 78.646396 |
| 2023-07-01 | 3777.443070 | 5093.504450 | 79.917621 |
| 2023-10-01 | 3801.067908 | 5174.946932 | 81.188473 |
The predictions from the VAR model were plot against the time series to visualize their accuracy.
def plot_multiple_predictions(train, test, predictions, plot_title, plot_size=(8,7)):
"""
Plots the training, testing and predictions sets for multiple time series in a single visual.
Parameters:
train (pandas.dataframe): Training set for the multiple time series in their original scale.
test (pandas.dataframe): Testing set for the multiple time series in their original scale.
predictions (pandas.dataframe): Prediction set for the multiple time series in their original scale.
plot_title (str): Title for saving the plot
plot_size (tuple of integers): Tuple for the figure size. (8,7) by default.
Returns:
None
"""
# Colors
time_series_color=sns.color_palette('Blues_r')[0]
test_color = 'green'
pred_color = '#C70039'
# Data
cols = train.columns.values
# Fig overall size
plt.figure(figsize=plot_size)
# Loop
i = 1
for col in cols:
# Plots
plt.subplot(len(cols), 1, i)
ax = sns.lineplot(train[col], color=time_series_color, lw=1, zorder=2)
sns.lineplot(test[col], color=test_color, lw=1, linestyle='dashed', zorder=1)
sns.lineplot(predictions[col], color=pred_color, lw=2, zorder=3)
# Adding shadow to predictions
first_pred_x = predictions.index[0]
last_pred_x = predictions.index[-1]
ax.axvspan(first_pred_x, last_pred_x, color='#808080', alpha=0.2)
# Removing grid
plt.grid(False)
# Title
plt.title(col, y=0.8, loc='left')
# Removing labels and ticks
plt.ylabel('')
plt.xlabel('')
#plt.yticks([])
#plt.xticks([])
i += 1
# Adding a legend
legend_x = 1.12
legend_y = 1.15 * len(cols)
legend_lines = [Line2D([0], [0], color=time_series_color, lw=2),
Line2D([0], [0], color=test_color, lw=2, linestyle='dashed'),
Line2D([0], [0], color=pred_color, lw=2)]
plt.legend(legend_lines,
['Training', 'Testing', 'Predictions'],
loc='upper center',
bbox_to_anchor=(legend_x, legend_y),
facecolor='#ececec',
frameon = True)
# Adjusting ticks in plot
# max_xticks = 12
# xloc = plt.MaxNLocator(max_xticks)
# ax.xaxis.set_major_locator(xloc)
# plt.xticks(rotation=45)
# Adding a x label
plt.xlabel('Date')
title = "Predictions from " + plot_title + " vs. Historical Time Series"
plt.savefig(f'Images/fig_{title.lower().replace(" ", "_").replace(".", "")}.png', bbox_inches = 'tight')
plt.show()
plot_multiple_predictions(train=X_model[:cutoff + 2],
test=X_model[cutoff + 1:],
predictions=X_pred_var_model,
plot_title='VAR Model')
def plot_individual_predictions(train, test, predictions, plot_size=(8,5)):
"""
Plots the time series along with its correspondent predictions from a dataframe in an individual chart each.
Train, test and prediction parameters musth have the same columns.
Parameters:
train (pandas.datframe): Pandas dataframe with the training portion of the time series.
test (pandas.datframe): Pandas dataframe with the testing portion of the time series.
predictions (pandas.datframe): Pandas dataframe with the predictions for the time series.
plot_size (tuple of integers): Tuple for the figure size. (8,5) by default.
Returns:
None
"""
columns = list(train.columns)
train_color=sns.color_palette('Blues_r')[0]
test_color='green'
pred_color = '#C70039'
for col in columns:
# Drawing plots
fig, ax = plt.subplots(figsize=plot_size)
sns.lineplot(train[col], ax=ax, color=train_color, lw=1)
sns.lineplot(test[col], ax=ax, color=test_color, lw=1, linestyle='dashed')
sns.lineplot(predictions[col], ax=ax, color=pred_color, lw=2)
# Adding shadow to predictions
first_pred_x = test.index[0]
last_pred_x = test.index[-1]
ax.axvspan(first_pred_x, last_pred_x, color='#808080', alpha=0.2)
# Adding labels to plot
label = str(col).title().replace('_',' ')
chart_title = f'Predicted {label} Over Time'
plt.title(chart_title)
plt.ylabel(f'{label}')
plt.xlabel('Time')
# Adding legend to plot
legend_lines = [Line2D([0], [0], color=train_color, lw=2),
Line2D([0], [0], color=test_color, lw=2, linestyle='dashed'),
Line2D([0], [0], color=pred_color, lw=2)]
plt.legend(legend_lines, ['Training', 'Testing', 'Predictions'], loc='upper left', facecolor='white', frameon=True)
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png', bbox_inches = 'tight')
plt.show()
plot_individual_predictions(train=X_model[:cutoff + 2],
test=X_model[cutoff + 1:],
predictions=X_pred_var_model)
In view of the plots above, it can be seen that the VAR model was not able to yield accurate predictions. The only economic indicator whose predictions were more or less accurately was units. However, it is important to bear in mind that the COVID pandemic disrupted the historical trends not only in Mexico, but in the entire world. So, fairly, it was very dificult to expect that the VAR model could provide accurate predictions for the years 2021-2023 using the data from 2014-2020 for training.
The RMSE, MAE, and $\bf{r^{2}}$ score were calculated for each feature as follows:
var_rmse, var_mae, var_r2 = calculate_scores(predictions=X_pred_var_model, actuals=X_model[cutoff + 1:])
units RMSE: 41.684 MAE: 30.270 Coefficient of Determination: 0.750 sp500 RMSE: 741.698 MAE: 699.704 Coefficient of Determination: -9.352 walmex RMSE: 7.841 MAE: 6.134 Coefficient of Determination: -7.682
# Adding scores to lists
X_rmse_list.append(var_rmse)
X_mae_list.append(var_mae)
X_r2_list.append(var_r2)
As shown above, the original dataset comprised $40$ historical data points about $7$ features (economic indicators) and $1$ response variable (net sales of WALMEX in millions of MXN). However, based on the results from the Johanssen's cointegration test, only the series units, SP&500, and WALMEX shared enough the same underlying stochastic trend to be used in a multivariate times series model.
In this sense, a second model for multivariate times series prediction was built using a Vector Autoregressive Moving Average (VARMA) model for forecasting the $3$ selected features. A VARMA model was selected as a generalization of the VAR model to include a moving average process (Peixeiro, 2022).
The VARMA($p$, $q$) model describes the relationship among two or more time series and the impact that their past values and past error terms have on each other (Peixeiro, 2022). Similar to the VAR model, the VARMA($p$, $q$) model is a generalization of the ARMA model for multiple time series, where the order $p$ determines how many lagged values impact the present value of the different time series, and the order $q$ determines the number of past error terms that affect their present values.
Several orders of $p$ and $q$ were tested and the best performing model was selected according to the Akaike Information Criterion (AIC). Then, the Granger causality test was applied to determine whether past values of a time series are statistically significant in forecasting another time series. Then, a residual analysis was finally carried out to assess whether the residuals were normally and independently distributed (Peixeiro, 2022).
The function VARMAX from the library statsmodels in Python was used to build the VARMA model (Peixeiro, 2022).
It is assumed that the features are cointegrated, i.e., the different features share an underlying stochastic trend, and that the linear combination of the different features is stationary (Clower, 2020).
Moreover, it was assumed that the current values are linearly dependent on their past values and past error terms (Peixeiro, 2022); and that the different time series were stationary, they were not a random walk, and seasonality effects were not relevant for modeling.
The dataset was split into a training and a testing sets, allocating $80\%$ and $20\%$ of the data, respectively.
selected_features = ['units', 'sp500', 'walmex']
X_model = X[selected_features]
X_diff_model = X_diff[selected_features]
cutoff = int(0.8*(len(X_diff_model)))
X_train = X_diff_model[:cutoff + 1]
X_test = X_diff_model[cutoff:]
Later, the VARMA model was built using the class VARMAX from the statsmodels library.
def VARMA_model(p, q, series):
"""
Fits a series of VARMA models based on the combinations of p and q provided and
returns the best performing model according to the Akaike Information Criterion (AIC).
Parameters:
p (list): Orders for the autoregressive process of the model.
q (list): Orders for the moving average process of the model.
series (pandas.dataframe): Data for the time series for fitting the model.
Returns:
model (statsmodels.varmax): A VARMA model object fitted according to the combination of p and q that minimizes
the Akaike Information Criterion (AIC).
"""
# Obtaining the combinations of p and q
order_list = list(product(p, q))
# Creating emtpy lists to store results
order_results = []
aic_results = []
# Fitting models
for order in order_list:
model = VARMAX(endog=series, order=(order[0], order[1])).fit()
order_results.append(order)
aic_results.append(model.aic)
# Converting lists to dataframes
results = pd.DataFrame({'(p,q)': order_results,
'AIC': aic_results
})
# Storing results from the best model
lowest_aic = results.AIC.min()
best_model = results.loc[results['AIC'] == lowest_aic, ['(p,q)']].values[0][0]
# Printing results
print(f'The best model is (p = {best_model[0]}, q = {best_model[1]}), with an AIC of {lowest_aic:.02f}.\n')
print(results)
# Fitting best model again
model = VARMAX(endog=series, order=(best_model[0], best_model[1])).fit()
return model
p_list = [1,2,3,4]
q_list = [1,2,3,4]
varma_model = VARMA_model(p=p_list, q=q_list, series=X_train)
The best model is (p = 1, q = 1), with an AIC of 929.44.
(p,q) AIC
0 (1, 1) 929.443620
1 (1, 2) 937.121768
2 (1, 3) 949.256778
3 (1, 4) 953.577824
4 (2, 1) 939.206120
5 (2, 2) 954.181567
6 (2, 3) 954.246749
7 (2, 4) 963.524345
8 (3, 1) 945.146871
9 (3, 2) 957.571176
10 (3, 3) 957.117134
11 (3, 4) 964.511917
12 (4, 1) 952.156263
13 (4, 2) 952.746762
14 (4, 3) 967.041897
15 (4, 4) 980.652654
So, the best model according to the AIC corresponds to VARMA($1$, $1$).
The Granger causality test was applied to determine whether past values of a time series are statistically significant in forecasting another time series (Peixeiro, 2022). This test is unidirectional, so it has to be applied twice for each series.
$$H_{0}: y_{2,t}\text{ does not Granger-cause }y_{1,t}$$$$H_{1}: y_{2,t}\text{ Granger-cause }y_{1,t}$$$$\alpha = 0.95$$gc_summary = test_granger_causality(df=X_diff_model, lag=1, alpha=0.05)
gc_summary
sp500 --> units Granger Causality number of lags (no zero) 1 ssr based F test: F=0.3146 , p=0.5784 , df_denom=35, df_num=1 ssr based chi2 test: chi2=0.3416 , p=0.5589 , df=1 likelihood ratio test: chi2=0.3401 , p=0.5598 , df=1 parameter F test: F=0.3146 , p=0.5784 , df_denom=35, df_num=1 sp500 does not Granger-causes units. units --> sp500 Granger Causality number of lags (no zero) 1 ssr based F test: F=0.4924 , p=0.4875 , df_denom=35, df_num=1 ssr based chi2 test: chi2=0.5346 , p=0.4647 , df=1 likelihood ratio test: chi2=0.5309 , p=0.4662 , df=1 parameter F test: F=0.4924 , p=0.4875 , df_denom=35, df_num=1 units does not Granger-causes sp500. walmex --> units Granger Causality number of lags (no zero) 1 ssr based F test: F=1.4071 , p=0.2435 , df_denom=35, df_num=1 ssr based chi2 test: chi2=1.5277 , p=0.2165 , df=1 likelihood ratio test: chi2=1.4978 , p=0.2210 , df=1 parameter F test: F=1.4071 , p=0.2435 , df_denom=35, df_num=1 walmex does not Granger-causes units. units --> walmex Granger Causality number of lags (no zero) 1 ssr based F test: F=0.7256 , p=0.4001 , df_denom=35, df_num=1 ssr based chi2 test: chi2=0.7878 , p=0.3748 , df=1 likelihood ratio test: chi2=0.7797 , p=0.3772 , df=1 parameter F test: F=0.7256 , p=0.4001 , df_denom=35, df_num=1 units does not Granger-causes walmex. walmex --> sp500 Granger Causality number of lags (no zero) 1 ssr based F test: F=1.0120 , p=0.3213 , df_denom=35, df_num=1 ssr based chi2 test: chi2=1.0988 , p=0.2945 , df=1 likelihood ratio test: chi2=1.0832 , p=0.2980 , df=1 parameter F test: F=1.0120 , p=0.3213 , df_denom=35, df_num=1 walmex does not Granger-causes sp500. sp500 --> walmex Granger Causality number of lags (no zero) 1 ssr based F test: F=1.6312 , p=0.2099 , df_denom=35, df_num=1 ssr based chi2 test: chi2=1.7710 , p=0.1833 , df=1 likelihood ratio test: chi2=1.7310 , p=0.1883 , df=1 parameter F test: F=1.6312 , p=0.2099 , df_denom=35, df_num=1 sp500 does not Granger-causes walmex.
| y2 | y1 | Granger Causality | Test Interpretation | SSR F-test p-values | SSR Chi2-test p-values | LR-test p-values | Params F-test p-values | |
|---|---|---|---|---|---|---|---|---|
| 0 | sp500 | units | False | sp500 does not Granger-causes units. | 0.5784 | 0.5589 | 0.5598 | 0.5784 |
| 1 | units | sp500 | False | units does not Granger-causes sp500. | 0.4875 | 0.4647 | 0.4662 | 0.4875 |
| 2 | walmex | units | False | walmex does not Granger-causes units. | 0.2435 | 0.2165 | 0.2210 | 0.2435 |
| 3 | units | walmex | False | units does not Granger-causes walmex. | 0.4001 | 0.3748 | 0.3772 | 0.4001 |
| 4 | walmex | sp500 | False | walmex does not Granger-causes sp500. | 0.3213 | 0.2945 | 0.2980 | 0.3213 |
| 5 | sp500 | walmex | False | sp500 does not Granger-causes walmex. | 0.2099 | 0.1833 | 0.1883 | 0.2099 |
Even though, the results from the Granger-causality tests suggested that it is not possible to reject the null hypothesis that the series does not Granger-causes each other, thus rendering the VARMA model invalid, it was decided to go further with the model.
The purpose of the residual analysis is to assess whether the difference between the actual and predicted values of the model is due to randomness or, in other words, that the residuals are normally and independently distributed (Peixeiro, 2022).
To do so, the residual analysis is performed in two steps (Peixeiro, 2022):
# Storing residuals
sc_X = StandardScaler()
residuals = sc_X.fit_transform(varma_model.resid)
# Plotting QQ-Plot for units
chart_title = 'Q-Q plot for standardized residuals for unit series from VARMA model'
varma_model.plot_diagnostics(figsize=(10, 8), variable=0)
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png', bbox_inches = 'tight')
plt.show()
From the plot above, it seems that the standardized residuals for the units series have no trend and constant variance. The histogram also resembles a normal distribution. Moreover, the Q-Q plot shows a loosely straight line, although some curvature at the extremities can be seen. Finally, the correlogram shows no significant coefficients. Thus, it is possible to conclude that the residuals are somewhat close to white noise.
# Plotting QQ-Plot for SP&500
chart_title = 'Q-Q plot for standardized residuals for SP&500 series from VARMA model'
varma_model.plot_diagnostics(figsize=(10, 8), variable=1)
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png', bbox_inches = 'tight')
plt.show()
From the plot above, it seems that the standardized residuals for the SP&500 series have no trend and constant variance. The histogram also fairly resembles a normal distribution. Moreover, the Q-Q plot shows a fairly straight line despite curvature at the extremities can be seen. Finally, the correlogram shows no significant coefficients. Thus, it is possible to conclude that the residuals are close to white noise.
# Plotting QQ-Plot for WALMEX
chart_title = 'Q-Q plot for standardized residuals for WALMEX series from VARMA model'
varma_model.plot_diagnostics(figsize=(10, 8), variable=2)
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png', bbox_inches = 'tight')
plt.show()
From the plot above, it seems that the standardized residuals for the WALMEX series have no trend and constant variance. The histogram also fairly resembles a normal distribution. Moreover, the Q-Q plot shows a fairly straight line. Finally, the correlogram shows no significant coefficients. Thus, it is possible to conclude that the residuals are close to white noise.
Finally, the residuals were assessed by testing for uncorrelation with the Ljung-Box test (Peixeiro, 2022), using a 95% confidence level:
$$H_{0}: No\text{-}autocorrelation$$$$H_{1}: Autocorrelation$$$$\alpha = 0.95$$# Ljung-Box test of the residuals on 10 lags
cols = X_model.columns.values
for i in range(0, len(cols)):
print(cols[i])
print(acorr_ljungbox(residuals[:,i], np.arange(1, 11)))
print("\n")
units
lb_stat lb_pvalue
1 0.039994 0.841493
2 1.689027 0.429766
3 2.133470 0.545171
4 3.062334 0.547449
5 4.723216 0.450584
6 5.216527 0.516357
7 6.088058 0.529506
8 9.093956 0.334432
9 9.647166 0.379797
10 10.136544 0.428597
sp500
lb_stat lb_pvalue
1 0.265940 0.606068
2 0.338502 0.844297
3 0.723513 0.867660
4 2.604584 0.626011
5 2.995088 0.700743
6 3.297182 0.770728
7 3.396420 0.846072
8 5.970563 0.650529
9 6.945731 0.642770
10 8.531878 0.577029
walmex
lb_stat lb_pvalue
1 0.016821 0.896808
2 0.530911 0.766857
3 1.258959 0.738901
4 3.589527 0.464397
5 4.365721 0.498048
6 5.651316 0.463360
7 9.178720 0.240078
8 9.653880 0.290164
9 9.935129 0.355762
10 10.193264 0.423704
For each of the lags from 1 to 10, the $p\text{-}values$ were above $0.05$. Thus, the null hypothesis cannot be rejected, meaning that no autocorrelation was found on the set of residuals from the VARMA model. Thus, the residuals are independently distributed and the model could be used for forecasting.
The model was then used to forecast the features in the dataset, using the length of the testing set.
# Forecasting
X_pred_varma_model = varma_model.get_prediction(start=0, end=len(X_test) - 1).predicted_mean.values
X_pred_varma_model
array([[ 2.63279862e+01, 9.43190298e+01, 1.44490053e+00],
[ 3.86450775e+01, 5.98082098e+01, 1.23995013e+00],
[ 2.25953637e+01, 8.17870839e+01, 2.24825348e-01],
[ 1.47676179e+00, 1.30564027e+02, 1.87963976e+00],
[ 4.07761480e+01, 8.01085645e+01, 2.82799959e+00],
[ 4.07791124e+01, 3.81162321e+01, 1.19670208e-01],
[ 3.11122408e+01, 3.07986560e+01, -9.63151577e-01],
[ 2.76706732e+01, 1.35214035e+01, 1.65795496e-01]])
# Bringing predictions to original scale
for i in range(0, len(X_model.columns)):
X_pred_varma_model[:,i] = X_model[:cutoff+1].values[-1,i] + X_pred_varma_model[:,i].cumsum()
# Converting array of predictions into a dataframe
features_names = X_model.columns
index_labels = X_diff_model[cutoff:].index
X_pred_varma_model = pd.DataFrame(data=X_pred_varma_model, index=index_labels, columns=features_names)
X_pred_varma_model
| units | sp500 | walmex | |
|---|---|---|---|
| date | |||
| 2022-01-01 | 3646.327986 | 4694.666119 | 74.433126 |
| 2022-04-01 | 3684.973064 | 4754.474329 | 75.673076 |
| 2022-07-01 | 3707.568427 | 4836.261413 | 75.897902 |
| 2022-10-01 | 3709.045189 | 4966.825440 | 77.777542 |
| 2023-01-01 | 3749.821337 | 5046.934004 | 80.605541 |
| 2023-04-01 | 3790.600450 | 5085.050236 | 80.725211 |
| 2023-07-01 | 3821.712690 | 5115.848892 | 79.762060 |
| 2023-10-01 | 3849.383364 | 5129.370296 | 79.927855 |
The predictions from the VARMA model were plot against the time series to visualize their accuracy.
plot_multiple_predictions(train=X_model[:cutoff + 2],
test=X_model[cutoff + 1:],
predictions=X_pred_varma_model,
plot_title='VARMA Model')
plot_individual_predictions(train=X_model[:cutoff + 2],
test=X_model[cutoff + 1:],
predictions=X_pred_varma_model)
In view of the plots above, it can be seen that the VARMA model was not able to yield accurate predictions. The only economic indicator whose predictions were accurately was units. However, it is important to bear in mind that the COVID pandemic disrupted the historical trends not only in Mexico, but in the entire world. So, fairly, it was very dificult to expect that the VARMA model could provide accurate predictions for the years 2021-2023 using the data from 2014-2020 for training. Overall, it seems that the performance of the VARMA model was better than that of the VAR model.
The RMSE, MAE, and $\bf{r^{2}}$ score were calculated for each feature as follows:
varma_rmse, varma_mae, varma_r2 = calculate_scores(predictions=X_pred_varma_model, actuals=X_model[cutoff + 1:])
units RMSE: 29.895 MAE: 26.242 Coefficient of Determination: 0.871 sp500 RMSE: 806.357 MAE: 763.090 Coefficient of Determination: -11.235 walmex RMSE: 8.352 MAE: 7.041 Coefficient of Determination: -8.852
# Adding scores to lists
X_rmse_list.append(varma_rmse)
X_mae_list.append(varma_mae)
X_r2_list.append(varma_r2)
As shown above, the original dataset comprised $40$ historical data points about $7$ features (economic indicators) and $1$ response variable (net sales of WALMEX in millions of MXN). However, based on the results from the Johanssen's cointegration test, only the series units, SP&500, and WALMEX shared enough the same underlying stochastic trend to be used in a multivariate times series model.
In this sense, a third model for multivariate times series prediction was built using a Vector Autoregressive Integrated Moving Average (VARIMA) model for forecasting the $3$ selected features. A VARIMA model was selected as a generalization of the VARMA model to include an integration process (Peixeiro, 2022) for forecasting non-stationary time series, meaning that the time series have a trend, and/or their variance is not constant.
The VARIMA($p$, $d$, $q$) model describes the relationship among two or more non-stationary time series and the impact that their past values and past error terms have on each other (Peixeiro, 2022). Similar to the VARMA model, the VARIMA($p$, $d$, $q$) model is a generalization of the ARMA model for multiple non-stationary time series, where the order $p$ determines how many lagged values impact the present value of the different time series, the order $d$ indicates the number of times the time series have been differenced to become stationary, and the order $q$ determines the number of past error terms that affect their present values.
The function VARIMA from the library Darts in Python was used to build the VARIMA model (Herzen et al., 2022).
As the implementation of the VARIMA model in Darts does not supports the prediction of likelihood parameters, the Akaike Information Criterion (AIC) could not be computed for a different set of $p$ and $q$ values. So, the same orders of $p=1$ and $q=1$ obtained during the optimization of the VARMA model were used for VARIMA($1$,$1$).
Then, the Granger causality test was applied to determine whether past values of a time series are statistically significant in forecasting another time series. Then, a residual analysis was finally carried out to assess whether the residuals were normally and independently distributed (Peixeiro, 2022).
It is assumed that the features are cointegrated, i.e., the different features share an underlying stochastic trend, and that the linear combination of the different features is stationary (Clower, 2020).
Moreover, it was assumed that the current values are linearly dependent on their past values and past error terms (Peixeiro, 2022), the time series are not a random walk, and seasonality effects were not relevant for modeling.
The dataset was split into a training and a testing sets, allocating $80\%$ and $20\%$ of the data, respectively.
selected_features = ['units', 'sp500', 'walmex']
X_model = X[selected_features]
cutoff = int(0.8*(len(X_model)))
X_train = X_model[:cutoff]
X_test = X_model[cutoff:]
Later, the VARIMA model was built using the class VARIMA from the Darts library.
def VARIMA_model(p, d, q, series):
"""
Fits and optimize a VARIMA model based on the Akaike Information Criterion (AIC), given a set of p, d, and q values.
Parameters:
p (int): Order of the autoregressive process in the model.
d (int): Integration order in the model.
p (int): Order of the moving average process in the model.
series (pandas.dataframe): Data for the time series for fitting the model.
Returns:
model (darts.varima): A VARIMA object fitted according to the combination of p, d and q that minimizes
the Akaike Information Criterion (AIC).
"""
# Converting pandas.dataframe to Darts.TimeSeries
series = TimeSeries.from_dataframe(series)
model = VARIMA(p=p, d=d, q=q, trend="n") # No trend for models with integration
model.fit(series)
return model
varima_model = VARIMA_model(p=1, d=1, q=1, series=X_train)
The Granger causality test was applied to determine whether past values of a time series are statistically significant in forecasting another time series (Peixeiro, 2022). This test is unidirectional, so it has to be applied twice for each series.
$$H_{0}: y_{2,t}\text{ does not Granger-cause }y_{1,t}$$$$H_{1}: y_{2,t}\text{ Granger-cause }y_{1,t}$$$$\alpha = 0.95$$gc_summary = test_granger_causality(df=X_model, lag=1, alpha=0.05)
gc_summary
sp500 --> units Granger Causality number of lags (no zero) 1 ssr based F test: F=2.7736 , p=0.1045 , df_denom=36, df_num=1 ssr based chi2 test: chi2=3.0048 , p=0.0830 , df=1 likelihood ratio test: chi2=2.8946 , p=0.0889 , df=1 parameter F test: F=2.7736 , p=0.1045 , df_denom=36, df_num=1 sp500 does not Granger-causes units. units --> sp500 Granger Causality number of lags (no zero) 1 ssr based F test: F=2.2995 , p=0.1382 , df_denom=36, df_num=1 ssr based chi2 test: chi2=2.4911 , p=0.1145 , df=1 likelihood ratio test: chi2=2.4148 , p=0.1202 , df=1 parameter F test: F=2.2995 , p=0.1382 , df_denom=36, df_num=1 units does not Granger-causes sp500. walmex --> units Granger Causality number of lags (no zero) 1 ssr based F test: F=1.6690 , p=0.2046 , df_denom=36, df_num=1 ssr based chi2 test: chi2=1.8081 , p=0.1787 , df=1 likelihood ratio test: chi2=1.7674 , p=0.1837 , df=1 parameter F test: F=1.6690 , p=0.2046 , df_denom=36, df_num=1 walmex does not Granger-causes units. units --> walmex Granger Causality number of lags (no zero) 1 ssr based F test: F=1.6634 , p=0.2054 , df_denom=36, df_num=1 ssr based chi2 test: chi2=1.8021 , p=0.1795 , df=1 likelihood ratio test: chi2=1.7617 , p=0.1844 , df=1 parameter F test: F=1.6634 , p=0.2054 , df_denom=36, df_num=1 units does not Granger-causes walmex. walmex --> sp500 Granger Causality number of lags (no zero) 1 ssr based F test: F=0.8685 , p=0.3576 , df_denom=36, df_num=1 ssr based chi2 test: chi2=0.9409 , p=0.3320 , df=1 likelihood ratio test: chi2=0.9297 , p=0.3349 , df=1 parameter F test: F=0.8685 , p=0.3576 , df_denom=36, df_num=1 walmex does not Granger-causes sp500. sp500 --> walmex Granger Causality number of lags (no zero) 1 ssr based F test: F=10.9697 , p=0.0021 , df_denom=36, df_num=1 ssr based chi2 test: chi2=11.8839 , p=0.0006 , df=1 likelihood ratio test: chi2=10.3734 , p=0.0013 , df=1 parameter F test: F=10.9697 , p=0.0021 , df_denom=36, df_num=1 sp500 Granger-causes walmex.
| y2 | y1 | Granger Causality | Test Interpretation | SSR F-test p-values | SSR Chi2-test p-values | LR-test p-values | Params F-test p-values | |
|---|---|---|---|---|---|---|---|---|
| 0 | sp500 | units | False | sp500 does not Granger-causes units. | 0.1045 | 0.0830 | 0.0889 | 0.1045 |
| 1 | units | sp500 | False | units does not Granger-causes sp500. | 0.1382 | 0.1145 | 0.1202 | 0.1382 |
| 2 | walmex | units | False | walmex does not Granger-causes units. | 0.2046 | 0.1787 | 0.1837 | 0.2046 |
| 3 | units | walmex | False | units does not Granger-causes walmex. | 0.2054 | 0.1795 | 0.1844 | 0.2054 |
| 4 | walmex | sp500 | False | walmex does not Granger-causes sp500. | 0.3576 | 0.3320 | 0.3349 | 0.3576 |
| 5 | sp500 | walmex | True | sp500 Granger-causes walmex. | 0.0021 | 0.0006 | 0.0013 | 0.0021 |
Even though, the results from the Granger-causality tests suggested that it is not possible to reject the null hypothesis that the series does not Granger-causes each other, thus rendering the VARIMA model invalid, it was decided to go further with the model.
The purpose of the residual analysis is to assess whether the difference between the actual and predicted values of the model is due to randomness or, in other words, that the residuals are normally and independently distributed (Peixeiro, 2022).
To do so, the residual analysis is performed in two steps (Peixeiro, 2022):
# Getting residuals
residuals = varima_model.residuals(TimeSeries.from_dataframe(X_train), retrain=False, start=0).pd_dataframe()
residuals
`start` value `0` corresponding to timestamp `2014-01-01 00:00:00` is before the first predictable/trainable historical forecasting point for series at index: 0. Ignoring `start` for this series and beginning at first trainable/predictable time: 2021-07-01 00:00:00. To hide these warnings, set `show_warnings=False`.
| component | units | sp500 | walmex |
|---|---|---|---|
| time | |||
| 2021-07-01 | 17.695093 | 42.250966 | 2.757533 |
| 2021-10-01 | 16.102259 | 22.015971 | -1.820832 |
However, as the implementation of the VARIMA model in Darts only yielded the residuals for the last two observations, the residual analysis couldn't be performed properly.
Thus, the residuals were only assessed by testing for uncorrelation with the Ljung-Box test (Peixeiro, 2022), using a 95% confidence level:
$$H_{0}: No\text{-}autocorrelation$$$$H_{1}: Autocorrelation$$$$\alpha = 0.95$$# Ljung-Box test of the residuals on 10 lags
cols = X_model.columns.values
for i in range(0, len(cols)):
print(cols[i])
print(acorr_ljungbox(residuals.values[:,i], np.arange(1, 3)))
print("\n")
units lb_stat lb_pvalue 1 2.0 0.157299 2 inf 0.000000 sp500 lb_stat lb_pvalue 1 2.0 0.157299 2 inf 0.000000 walmex lb_stat lb_pvalue 1 2.0 0.157299 2 inf 0.000000
For the lag 1, the $p\text{-}values$ were above $0.05$. Thus, the null hypothesis cannot be rejected, meaning that no autocorrelation was found on the set of residuals from the VARMA model. Thus, in theory, the residuals are independently distributed and the model could be used for forecasting.
Of course, strictly speaking, as the residual analysis could not be properly performed, it is not possible to know whether the model is valid or not. However, for the purposes of the present study, the model was still used for generating predictions.
The model was then used to forecast the features in the dataset, using the length of the testing set.
X_pred_varima_model = varima_model.predict(X_test.shape[0])
X_pred_varima_model = X_pred_varima_model.pd_dataframe()
X_pred_varima_model
| component | units | sp500 | walmex |
|---|---|---|---|
| date | |||
| 2022-01-01 | 3651.938437 | 4750.952164 | 75.167275 |
| 2022-04-01 | 3669.012984 | 4884.226400 | 78.487810 |
| 2022-07-01 | 3694.286642 | 4991.645860 | 81.403402 |
| 2022-10-01 | 3714.224177 | 5082.650012 | 83.158005 |
| 2023-01-01 | 3727.063879 | 5161.802024 | 84.974235 |
| 2023-04-01 | 3740.701532 | 5227.447196 | 86.644681 |
| 2023-07-01 | 3752.511366 | 5282.648505 | 87.837714 |
| 2023-10-01 | 3761.117644 | 5330.036183 | 88.904245 |
The predictions from the VARIMA model were plot against the time series to visualize their accuracy.
plot_multiple_predictions(train=X_model[:cutoff+1],
test=X_model[cutoff:],
predictions=X_pred_varima_model,
plot_title='VARIMA Model')
plot_individual_predictions(train=X_model[:cutoff + 1],
test=X_model[cutoff:],
predictions=X_pred_varima_model)
In view of the plots above, it can be seen that the VARIMA model was not able to yield accurate predictions. The only economic indicator whose predictions were more less accurate was units. However, it is important to bear in mind that the COVID pandemic disrupted the historical trends not only in Mexico, but in the entire world. So, fairly, it was very dificult to expect that the VARMA model could provide accurate predictions for the years 2021-2023 using the data from 2014-2020 for training. Notwithstanding with the above, it seems that the performance of the VARIMA model was slightly worse than that of the VARMA model.
The RMSE, MAE, and $\bf{r^{2}}$ score were calculated for each feature as follows:
varima_rmse, varima_mae, varima_r2 = calculate_scores(predictions=X_pred_varima_model, actuals=X_model[cutoff:])
units RMSE: 57.604 MAE: 42.577 Coefficient of Determination: 0.523 sp500 RMSE: 939.278 MAE: 898.338 Coefficient of Determination: -15.601 walmex RMSE: 14.040 MAE: 12.263 Coefficient of Determination: -26.838
# Adding scores to lists
X_rmse_list.append(varima_rmse)
X_mae_list.append(varima_mae)
X_r2_list.append(varima_r2)
As it was desired to predict a target numeric value, a regression model was built for predicting the net sales of WALMEX based on the forecast for the selected $3$ features from the multivariate time series models: units, S&P500, and WALMEX stock value.
It was decided to use a Random Forest (RF) approach, as this algorithm provides better models than decision trees and allows to easily identify the most important features in a model (Géron, 2019). Moreover, it is not necessary to neither comply with the assumptions of the linear regression nor to perform feature scaling.
The main assumption is that the net sales of WALMEX can be predicted as a function of the economic indicators of Mexico and USA.
Furthermore, it was assumed that the problem was not linearly separable, thus, a linear regression approach was discarded.
The dataset was split into a training and a testing sets, allocating $80\%$ and $20\%$ of the data, respectively.
selected_features = ['units', 'sp500', 'walmex']
X_model = X[selected_features]
X_train, X_test, y_train, y_test = train_test_split(X_model, y, test_size = 0.2, random_state = 0)
The random forest model was built using the RandomForestRegressor class from the scikit-learn library.
rf_model = RandomForestRegressor(n_estimators = 500, random_state = 0)
rf_model.fit(X_train, y_train)
RandomForestRegressor(n_estimators=500, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestRegressor(n_estimators=500, random_state=0)
The model was then used to predict the response variable using the features from the testing set.
y_pred_rf_model = rf_model.predict(X_test)
# Transforming predictions into a Pandas series
y_pred_rf_model = pd.DataFrame(y_pred_rf_model, index=X_test.index, columns=['net_sales'])
y_pred_rf_model
| net_sales | |
|---|---|
| date | |
| 2019-07-01 | 167946.916 |
| 2019-01-01 | 164339.552 |
| 2020-04-01 | 167019.450 |
| 2015-01-01 | 119220.126 |
| 2016-07-01 | 126007.214 |
| 2017-10-01 | 144273.058 |
| 2021-01-01 | 186783.454 |
| 2016-10-01 | 124694.170 |
Then, the importance of each feature was retrieved using the Mean Decrease in Impurity criterion:
importances = rf_model.feature_importances_
importances
array([0.67663784, 0.14493053, 0.17843163])
features_names = X_model.columns
importances_series = pd.Series(importances, index=features_names)
chart_title = "Feature Importances in Random Forest Regression Model"
fig, ax = plt.subplots(figsize=(8,5))
importances_series.sort_values(ascending=False).plot.bar(ax=ax)
plt.title(chart_title)
plt.xlabel("Features")
plt.ylabel("Mean Decrease in Impurity")
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png', bbox_inches = 'tight')
plt.show()
Thus, in view of the plot above, the feature Units was the most important in the model.
The predictions were plot against the historical net sales data to visually assess the performance of the regression model.
def plot_regression_predictions(predictions, time_series=y, chart_title='Predictions from Regression Model'):
"""
Plots the predictions from a regression model compared to a historical time series.
Parameters:
predictions (pandas.dataframe): Predictions from the regression model with a DateTime index.
chart_title (str): Title for the chart.
Returns:
None
"""
# Colors
time_series_color = sns.color_palette('Blues_r')[0]
pred_color = '#C70039'
# If predictions and time_series are DataFrames, select the first column
if isinstance(predictions, pd.DataFrame):
predictions = predictions.iloc[:, 0]
if isinstance(time_series, pd.DataFrame):
time_series = time_series.iloc[:, 0]
# Plots
fig, ax = plt.subplots(figsize=(8,5))
sns.lineplot(x=time_series.index, y=time_series.values, ax=ax, color=time_series_color, zorder=1)
sns.scatterplot(x=predictions.index, y=predictions.values, ax=ax, color = pred_color, zorder=2)
# Labels
plt.title(chart_title)
plt.xlabel("Date")
plt.ylabel("Net Sales (mdp)")
# Adding legend to plot
legend_lines = [Line2D([0], [0], color=time_series_color, lw=2),
Line2D([0], [0], color=pred_color, lw=2, linestyle='None', marker="o")]
plt.legend(legend_lines, ['Time Series', 'Predictions'], loc='upper left', facecolor='white', frameon = True)
# Adjusting Y ticks to Currency format
ticks = ax.get_yticks()
new_labels = [f'${int(i):,.0f}' for i in ticks]
ax.set_yticklabels(new_labels)
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png', bbox_inches = 'tight')
plt.show()
plot_regression_predictions(predictions=y_pred_rf_model,
chart_title="Predictions from RF Model VS. WALMEX Historical Net Sales")
From the plot above, it can be seen that the predictions are very close of the historical time series. Thus, the Random Forest regression model was able to capture the most important features for predicting the net sales of WALMEX.
Then, the RMSE, MAE, and $\bf{r^{2}}$ score were calculated for each feature as follows:
rf_rmse, rf_mae, rf_r2 = calculate_scores(pd.DataFrame(y_test), y_pred_rf_model)
net_sales RMSE: 16256.433 MAE: 12380.448 Coefficient of Determination: 0.516
# Appending values to scores lists
reg_rmse_list.append(rf_rmse[0])
reg_mae_list.append(rf_mae[0])
reg_r2_list.append(rf_r2[0])
Several number of trees were tested, and the following results were obtained:
| Trees | RMSE | MAE | $r^{2}$ |
|---|---|---|---|
| 50 | 16465.794 | 13240.795 | 0.504 |
| 100 | 16464.343 | 12533.214 | 0.513 |
| 500 | 16256.433 | 12380.448 | 0.516 |
Thus, based on their performance, the model selected was the RF model with 500 trees.
Likewise, another regression model was built for predicting the net sales of WALMEX based on the forecast for the selected $3$ features from the multivariate time series models: units, S&P500, and WALMEX stock value.
A Support Vector Regression (SVR) approach was selected, as this algorithm is less restrictive in their assumptions in comparison to a linear regression, it's more robust to outliers, and does not strictly assume a linear relationship between the independent and dependent variables even if using a linear kernel (Géron, 2019).
The main assumption is that the net sales of WALMEX can be predicted as a function of the economic indicators of Mexico and USA.
Also, a strict linear relationship between the variables was not assummed, neither the requirements of normally distribution of the residuals, no multicollinearity, and homoscedasticity.
Firstly, according to the cointegration test, the data was sliced to select only the features Units, SP&500, and WALMEX stock value.
selected_features = ['units', 'sp500', 'walmex']
X_model = X[selected_features]
Then, the dataset was split into a training and a testing sets, allocating $80\%$ and $20\%$ of the data, respectively.
X_train, X_test, y_train, y_test = train_test_split(X_model, y, test_size = 0.2, random_state = 0)
Firstly, the data was scaled using the function StandardScaler.
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)
Then, the SVR model was built using the SVR class from the scikit-learn library.
svr_model = SVR(kernel='linear', C=1000.0, epsilon=0.1).fit(X_train_sc, y_train)
The model was then used to predict the response variable using the features from the testing set.
y_pred_svr_model = svr_model.predict(X_test_sc)
# Transforming predictions into a Pandas series
y_pred_svr_model = pd.DataFrame(y_pred_svr_model, index=X_test.index, columns=['net_sales'])
y_pred_svr_model
| net_sales | |
|---|---|
| date | |
| 2019-07-01 | 164807.934413 |
| 2019-01-01 | 151221.875855 |
| 2020-04-01 | 163867.176807 |
| 2015-01-01 | 123791.350612 |
| 2016-07-01 | 130167.111614 |
| 2017-10-01 | 141331.989989 |
| 2021-01-01 | 179614.339687 |
| 2016-10-01 | 129726.974562 |
The predictions were plot against the historical net sales data to visually assess the performance of the regression model.
plot_regression_predictions(predictions=y_pred_svr_model,
chart_title="Predictions from SVR Model VS. WALMEX Historical Net Sales")
From the plot above, it can be seen that the predictions are close of the historical time series. Thus, the SVR model was able to capture the most important features for predicting the net sales of WALMEX.
Then, the RMSE, MAE, and $\bf{r^{2}}$ score were calculated for each feature as follows:
svr_rmse, svr_mae, svr_r2 = calculate_scores(pd.DataFrame(y_test), y_pred_svr_model)
net_sales RMSE: 14535.582 MAE: 10676.184 Coefficient of Determination: 0.402
# Appending values to scores lists
reg_rmse_list.append(svr_rmse[0])
reg_mae_list.append(svr_mae[0])
reg_r2_list.append(svr_r2[0])
Several values for the parameters C were tested, and the following results were obtained:
| C | Epsilon | RMSE | MAE | $r^{2}$ |
|---|---|---|---|---|
| 1.0 | 0.1 | 21540.030 | 15545.984 | -176851.575 |
| 100.0 | 0.1 | 17257.012 | 13353.768 | -10.351 |
| 1000.0 | 0.1 | 14535.582 | 10676.184 | 0.402 |
| 5000.0 | 0.1 | 16414.855 | 11640.130 | 0.470 |
| 10000.0 | 0.1 | 16223.993 | 11624.612 | 0.475 |
| 100000.0 | 0.1 | 18720.282 | 14280.358 | 0.465 |
Thus, based on their performance, the model selected was the SVR model with C=1000.0.
In this section, the research question was answered: Will Walmart of Mexico be able to double its sales within the next ten years? If so, when?.
Firstly, the models were ranked according to their performance, and the best univariate, multivariate and regression models were selected and retrained using all the historic data.
Then, the WALMEX net sales forecasts were estimated using the two approaches defined in this study:
In both cases, the net sales were forecasted over a period of ten years (2024Q1-2033Q4).
According to the model assessment performed to each model, the models were ranked in terms of the scores RMSE, MAR and $r^{2}$.
# Creating Pandas Dataframe from Score Lists of Univariate Time Series Models
univariate_scores_dict = {'Model': univariate_models_list,'RMSE': rmse_list, 'MAE': mae_list, 'Coeff. of Determination': r2_list}
univariate_scores_df = pd.DataFrame(univariate_scores_dict)
univariate_scores_df
| Model | RMSE | MAE | Coeff. of Determination | |
|---|---|---|---|---|
| 0 | MA | 15216.900678 | 9651.900000 | 0.464547 |
| 1 | AR | 7687.806004 | 6558.916411 | 0.863330 |
| 2 | AR w/Additive Decomp. | 11272.203504 | 8970.403291 | 0.706176 |
| 3 | ARMA | 5006.672720 | 4190.297162 | 0.942035 |
| 4 | ARIMA | 27274.027756 | 24120.852698 | -0.720156 |
| 5 | SARIMA | 2675.576039 | 2372.531418 | 0.983446 |
| 6 | SARIMAX | 21608.095277 | 20415.993569 | -0.079698 |
| 7 | SES | 180107.784220 | 166655.345652 | -74.012589 |
| 8 | HW | 9508.312432 | 7645.737108 | 0.790938 |
| 9 | Prophet | 12323.068089 | 10710.708420 | 0.648839 |
# Best performing univariate time series model in terms of RMSE
univariate_scores_df.loc[univariate_scores_df['RMSE'] == univariate_scores_df['RMSE'].min(), :]
| Model | RMSE | MAE | Coeff. of Determination | |
|---|---|---|---|---|
| 5 | SARIMA | 2675.576039 | 2372.531418 | 0.983446 |
# Best performing univariate time series model in terms of MAE
univariate_scores_df.loc[univariate_scores_df['MAE'] == univariate_scores_df['MAE'].min(), :]
| Model | RMSE | MAE | Coeff. of Determination | |
|---|---|---|---|---|
| 5 | SARIMA | 2675.576039 | 2372.531418 | 0.983446 |
Thus, in view of the results above, the $\text{SARIMA}(1,1,1)(1,1,1)_{4}$ model was the univariate time series model with the best performance in terms of the scores RMSE, MAR and $r^{2}$.
In the case of the multivariate models, the scores are as follows:
# Creating Pandas Dataframe from Score Lists of Multivariate Time Series Models
multivariate_scores_dict = {'Model': multivariate_models_list,'RMSE': X_rmse_list, 'MAE': X_mae_list, 'Coeff. of Determination': X_r2_list}
multivariate_scores_df = pd.DataFrame(multivariate_scores_dict)
multivariate_scores_df
| Model | RMSE | MAE | Coeff. of Determination | |
|---|---|---|---|---|
| 0 | VAR | [41.68363399538274, 741.6975334422666, 7.84074... | [30.27016894936554, 699.7044134829706, 6.13388... | [0.7500587264511199, -9.351694912545407, -7.68... |
| 1 | VARMA | [29.895275196270124, 806.357115051538, 8.35249... | [26.241590913676305, 763.0903497943536, 7.0409... | [0.8714382007640038, -11.235244127649448, -8.8... |
| 2 | VARIMA | [57.60360726976815, 939.2783902259093, 14.0402... | [42.57743313428142, 898.3375517962176, 12.2627... | [0.522683780550852, -15.601466594952402, -26.8... |
Thus, in view of the results above, the $\text{VARMA}(1, 1)$ model was the multivariate time series model with the best performance in terms of the scores RMSE, MAR and $r^{2}$.
Finally, the scores from the regression models is shown below:
# Creating Pandas Dataframe from Score Lists of Regression Models
regression_scores_dict = {'Model': regression_models_list,
'RMSE': reg_rmse_list,
'MAE': reg_mae_list,
'Coeff. of Determination': reg_r2_list}
regression_scores_df = pd.DataFrame(regression_scores_dict)
regression_scores_df
| Model | RMSE | MAE | Coeff. of Determination | |
|---|---|---|---|---|
| 0 | RF | 16256.433470 | 12380.448000 | 0.515834 |
| 1 | SVR | 14535.581695 | 10676.183853 | 0.401998 |
Overall, both regression models have a very similar performance. However, as the RF model may be forced to predict net sales values far outside the training range, the SVR was selected as the best performing model.
The $\text{SARIMA}(1,1,1)(1,1,1)_{4}$ model, which was the best performing univariate time series model, was retrained with all the available data to, then, generate the forecasts of the WALMEX net sales over the next ten years.
univariate_model = SARIMA_model(p=[1],
d=1,
q=[1],
P=[1],
D=1,
Q=[1],
m=4,
time_series=y)
The best model is (p = 1, d = 1, q = 1)(P = 1, D = 1, Q = 1)(m = 4), with an AIC of 709.74.
(p,d,q)(P,D,Q)m AIC
0 (1, 1, 1, 1, 1, 1, 4) 709.737995
The predictions were estimated as follows:
forecasted_years = 10
y_pred_univariate_model = univariate_model.get_prediction(start=len(y), end=(len(y)+(forecasted_years*4))).predicted_mean
y_pred_univariate_model = pd.DataFrame(y_pred_univariate_model).rename(columns={'predicted_mean':'net_sales'})
y_pred_univariate_model.index.name = 'date'
y_pred_univariate_model.tail()
| net_sales | |
|---|---|
| date | |
| 2033-01-01 | 375846.239256 |
| 2033-04-01 | 383747.990764 |
| 2033-07-01 | 383201.395711 |
| 2033-10-01 | 424049.659129 |
| 2034-01-01 | 393491.468211 |
Then, the $\text{VARMA}(1,1)$ model, which was the best performing multivariate time series model, was retrained with all the available data to generate the forecasts of the predictors units, S&P500, and WALMEX stock value, over the next ten years.
In turn, the above forecasts were used to predict the WALMEX net sales with the SVR model.
X_model = X[['units', 'sp500', 'walmex']]
X_model_diff = X_diff[['units', 'sp500', 'walmex']]
multivariate_model = VARMA_model(p=[1], q=[1], series=X_model_diff)
The best model is (p = 1, q = 1), with an AIC of 1130.01.
(p,q) AIC
0 (1, 1) 1130.007552
Then, the predictions from the VARMA model were obtained.
forecasted_years = 10
X_pred_multivariate_model = multivariate_model.get_prediction(start=0, end=(forecasted_years*4)).predicted_mean.values
# Bringing predictions to original scale
for i in range(0, len(X_model.columns)):
X_pred_multivariate_model[:,i] = X_model.values[-1,i] + X_pred_multivariate_model[:,i].cumsum()
# Converting array of predictions into a dataframe
features_names = X_model.columns
index_labels = y_pred_univariate_model.index
X_pred_multivariate_model = pd.DataFrame(data=X_pred_multivariate_model, index=index_labels, columns=features_names)
X_pred_multivariate_model.tail()
| units | sp500 | walmex | |
|---|---|---|---|
| date | |||
| 2033-01-01 | 5013.311546 | 6873.751990 | 98.957488 |
| 2033-04-01 | 5042.021394 | 7040.376839 | 100.593166 |
| 2033-07-01 | 5069.858295 | 7243.914734 | 102.655605 |
| 2033-10-01 | 5082.205722 | 7364.412426 | 106.121013 |
| 2034-01-01 | 5115.372305 | 7429.510337 | 108.052018 |
plot_multiple_predictions(train=X_model,
test=X_model,
predictions=X_pred_multivariate_model,
plot_title='Predictors Forecasts From VARMA Model')
Later, the SVR model was also retrained with all the historical data.
X_sc = sc.transform(X[['units', 'sp500', 'walmex']])
regressor = SVR(kernel='linear', C=1000, epsilon=0.1)
regressor.fit(X_sc, y)
SVR(C=1000, kernel='linear')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVR(C=1000, kernel='linear')
Finally, the predictions for the WALMEX net sales were obtained from the regression model when using the forecasts for the predictors from the VARMA model.
y_pred_multivariate_model = regressor.predict(sc.transform(X_pred_multivariate_model))
y_pred_multivariate_model
array([206251.94427811, 208882.66509991, 210690.84116724, 212791.10223392,
216495.10085519, 218276.31331256, 219481.22773762, 221351.32599743,
221858.30564074, 224353.24440028, 227774.69287867, 229968.45027504,
234232.26088527, 237652.34315964, 239058.23108442, 242502.62858522,
245262.6805464 , 248014.97532063, 249600.58095426, 250853.39336929,
252746.32868685, 256356.59550893, 258061.41293436, 260588.72123067,
263182.86719354, 264808.35944326, 267801.66801255, 273429.23789836,
278356.25264569, 281269.45989441, 286172.83336491, 289033.89837456,
290625.45742746, 290189.64514756, 290438.25998121, 291808.95300327,
293072.05409541, 296861.51496584, 301230.56186466, 305038.88432865,
308231.33496768])
# Converting predictions into a dataframe
y_pred_multivariate_model = pd.DataFrame(y_pred_multivariate_model, columns=['net_sales']).set_index(y_pred_univariate_model.index)
y_pred_multivariate_model.head()
| net_sales | |
|---|---|
| date | |
| 2024-01-01 | 206251.944278 |
| 2024-04-01 | 208882.665100 |
| 2024-07-01 | 210690.841167 |
| 2024-10-01 | 212791.102234 |
| 2025-01-01 | 216495.100855 |
Now, after all the modeling and evaluation, the predictions from the univariate and multivariate/regression models were assessed visually:
def plot_sales_forecast(time_series, predictions, chart_title):
"""
Plots a time series and its forecast.
Parameters:
time_series (pandas.dataframe): Historical time series data.
predictions (pandas.dataframe): Forecasts for the time series.
chart_title (str): Title to be displayed on the chart.
Returns:
None
"""
# Colors
time_series_color = sns.color_palette('Blues_r')[0]
pred_color = '#C70039'
# If predictions and time_series are DataFrames, select the first column
if isinstance(predictions, pd.DataFrame):
predictions = predictions.iloc[:, 0]
if isinstance(time_series, pd.DataFrame):
time_series = time_series.iloc[:, 0]
# Adjusting plots continuity
last_value = time_series.iloc[-1]
last_index = time_series.index[-1]
last_obs = pd.Series(last_value, index=[last_index])
predictions = pd.concat([predictions,last_obs]).sort_index()
# Plots
fig, ax = plt.subplots(figsize=(8,5))
sns.lineplot(x=time_series.index, y=time_series.values, ax=ax, color=time_series_color, zorder=1)
sns.lineplot(x=predictions.index, y=predictions.values, ax=ax, color = pred_color, zorder=2)
# Adding shadow to predictions
first_pred_x = predictions.index[0]
last_pred_x = predictions.index[-1]
ax.axvspan(first_pred_x, last_pred_x, color='#808080', alpha=0.2)
# Labels
plt.title(chart_title)
plt.xlabel("Date")
plt.ylabel("Net Sales (mdp)")
# Adding legend to plot
legend_lines = [Line2D([0], [0], color=time_series_color, lw=2),
Line2D([0], [0], color=pred_color, lw=2)]
plt.legend(legend_lines, ['Historical', 'Forecast'], loc='upper left', facecolor='white', frameon=True)
# Adjusting Y ticks to Currency format
ticks = ax.get_yticks()
new_labels = [f'${int(i):,.0f}' for i in ticks]
ax.set_yticklabels(new_labels)
plt.savefig(f'Images/fig_{chart_title.lower().replace(" ", "_")}.png', bbox_inches = 'tight')
plt.show()
# Forecast from the Univariate Model
plot_sales_forecast(time_series=y,
predictions=y_pred_univariate_model,
chart_title='Net Sales Forecast from Univariate Model')
# Forecast from the Multivariate/Regression Model
plot_sales_forecast(time_series=y,
predictions=y_pred_multivariate_model,
chart_title='Net Sales Forecast from Multivariate-Regression Models')
Finally, the sales on 2023Q3 were compared with the forecasts to determine when WALMEX's goal could be achieved (if possible).
def print_conclusion(predictions, time_series=y, last_date='2023-07-01'):
"""
Looks for the prediction in which the sales are doubled according with the input last date and prints the result.
Parameters:
predictions (pandas.dataframe): Net sales predictions data.
last_date (str): Date in which the sales are going to be compared to.
Returns:
None
"""
last_sales = time_series.loc[last_date]
sales_doubled = predictions.loc[predictions.iloc[:,0] >= 2*last_sales,:]
print(f'WALMEX last figure of net sales was ${float(last_sales):,.2f} (mdp) on 2023Q3.')
try:
# Converting to datetime
date = pd.to_datetime(str(sales_doubled.index[0]))
# Converting to year-quarter format
year_quarter = date.to_period('Q')
# Printing result
print(f'WALMEX will double its net sales in: {str(year_quarter)}, with a forecasted figure of about: ${float(sales_doubled.values[0]):,.2f} (mdp).')
except:
print('WALMEX will not be able to double its net sales within the next ten years.')
print_conclusion(y_pred_univariate_model)
WALMEX last figure of net sales was $211,436.00 (mdp) on 2023Q3. WALMEX will double its net sales in: 2033Q4, with a forecasted figure of about: $424,049.66 (mdp).
print_conclusion(y_pred_multivariate_model)
WALMEX last figure of net sales was $211,436.00 (mdp) on 2023Q3. WALMEX will not be able to double its net sales within the next ten years.
Thus, in view of the results from this study, according to the $\text{SARIMA}(1,1,1)(1,1,1)_{4}$ model, WALMEX will double its net sales in 2033Q3, with a forecasted figure of about $424,049.66 (mdp).
On the other hand, according to the $\text{VARMA}(1,1)$ and $SVR$ models, WALMEX will not be able to double its net sales within the next ten years. Even if the trends in their number of units, SP&500 and WALMEX stock value remains constant.
However, the combination of the $\text{VARMA}(1,1)$ and SVR models was not capable of capturing the seasonality of the historical net sales. Thus, it is considered that the predictions from the $\text{SARIMA}(1,1,1)(1,1,1)_{4}$ model are more reliable.
According to the above results, the best performing model was $\text{SARIMA}(1,1,1)(1,1,1)_{4}$, which was a univariate time series model fitted from the historical net sales data of WALMEX.
For the deployment, it is proposed to put said model into production using the Streamlit cloud services. Please click here to play with the interactive app.
According to the results from the present study, Walmart of Mexico (WALMEX) will meet its goal of doubling its sales from 211,436 mdp to 424,050 mdp in the third quarter of 2033, in about nine years, according to the fitted $\text{SARIMA}(1,1,1)(1,1,1)_{4}$ model.
Moreover, the combination of a VARMA model for forecasting economic variables and the SVR model for predicting the net sales based on the forecasts of the first proved to be unsucessful. Whereas a more traditional approach of a SARIMA model was more able to capture the trend and seasonality of the WALMEX net sales time series.
In this context, as future perspectives, it is suggested to further explore how to automate the collection of WALMEX net sales data for building an model that can be retrained and yield predictions each time a new data point is added to the time series.
%pip freeze > requirements.txt
Note: you may need to restart the kernel to use updated packages.